2604.21725v1 Apr 23, 2026 cs.CL

AEL: 개방형 환경을 위한 에이전트 진화 학습

AEL: Agent Evolving Learning for Open-Ended Environments

Dimitris N. Metaxas

Citations: 422

h-index: 11

K. Mei

Citations: 1,966

h-index: 14

Wujiang Xu

Citations: 124

h-index: 7

Jiaojiao Han

Citations: 60

h-index: 3

Minghao Guo

Citations: 52

h-index: 3

Xi Zhu

Citations: 27

h-index: 4

Han Zhang

Citations: 30

h-index: 3

최근 LLM 에이전트는 수백 개의 연속적인 에피소드로 구성된 개방형 환경에서 작동하지만, 여전히 대부분의 에이전트는 상태를 유지하지 못합니다. 즉, 각 작업은 과거 경험을 활용하여 미래의 더 나은 행동을 이끌어내지 않고 처음부터 시작됩니다. 핵심적인 문제는 무엇을 기억하는지가 아니라, 기억된 정보를 어떻게 활용하는지에 있습니다. 여기에는 어떤 검색 정책을 적용할지, 이전 결과를 어떻게 해석할지, 그리고 현재 전략 자체를 언제 바꿔야 하는지가 포함됩니다. 본 연구에서는 이러한 문제를 해결하기 위한 이중 시간 척도 프레임워크인 가침소 쑀국도 그국("Agent Evolving Learning", 가침소 그국도)를 소개합니다. 빠른 시간 척도에서는 톰슨 샘플링 방식을 사용하여 각 에피소드에서 어떤 메모리 검색 정책을 적용할지 학습합니다. 느린 시간 척도에서는 LLM 기반의 성찰 과정을 통해 실패 패턴을 진단하고, 에이전트의 의사 결정 프롬프트에 인과적 통찰력을 주입하여, 에이전트가 검색한 증거를 해석할 수 있는 프레임을 제공합니다. 10개의 서로 다른 산업 분야의 주식(tickers)을 사용한 순차적 포트폴리오 벤치마크(208 에피소드, 5개의 무작위 시드)에서, 가침소 그국도는 샤프 비율 2.13$\pm$0.47을 달성하여, 5개의 기존 자체 개선 방법과 모든 LLM이 아닌 기준 모델을 능가했으며, 모든 LLM 기반 접근 방식 중 가장 낮은 분산을 유지했습니다. 9가지 변형에 대한 분석 결과, ``적게 하면 더 좋다(less is more)''라는 패턴이 나타났습니다. 메모리와 성찰을 함께 사용하면 상태를 유지하지 않는 기준 모델보다 총 58%의 성능 향상을 얻을 수 있지만, 테스트한 모든 추가적인 메커니즘(계획 진화, 도구별 선택, 초기화, 기술 추출, 세 가지 신용 할당 방법)은 성능을 곀각력울입니다. 이는 에이전트의 자체 개선 과정에서 중요한 병목 현상이 아키텍처의 복잡성을 더하는 것이 아니라, 경험을 어떻게 활용해야 하는지를 스스로 진단하는 능력에 있다는 것을 보여줍니다. 코드 및 데이터: https://github.com/WujiangXu/AEL.

Original Abstract

LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.

3 Citations

0 Influential

27 Altmetric

138.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!