2604.07791v1 Apr 09, 2026 cs.AI

SEARL: 정책 및 도구 그래프 메모리의 공동 최적화를 통한 자가 진화 에이전트

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Jing Shao

Citations: 33

h-index: 2

Xinshun Feng

Citations: 9

h-index: 2

Xin Song

Citations: 37

h-index: 3

Lijun Li

Citations: 1,464

h-index: 10

Gongshen Liu

Citations: 702

h-index: 10

최근 강화 학습 및 검증 가능한 보상(RLVR) 분야의 발전은 단일 단계 추론 작업에서 상당한 잠재력을 보여주었습니다. 에이전트 기반 학습 패러다임이 자가 진화 방향으로 전환되면서, 모델은 도구를 활용하거나 명시적인 경험을 축적하여 경로로부터 학습하도록 점점 더 요구되고 있습니다. 그러나 기존 방법은 일반적으로 대규모 언어 모델이나 다중 에이전트 프레임워크에 의존하며, 이는 리소스가 제한된 환경에서의 배포를 어렵게 만듭니다. 또한, 결과 기반 보상의 고유한 희소성 또한 큰 과제를 야기합니다. 왜냐하면 에이전트는 일반적으로 작업 완료 시에만 피드백을 받기 때문입니다. 이러한 한계점을 극복하기 위해, 도구-메모리 기반의 자가 진화 에이전트 프레임워크인 SEARL을 제안합니다. 기존의 상호 작용 경험을 직접 활용하는 방식과 달리, 우리의 방법은 계획과 실행을 통합하는 구조화된 경험 메모리를 구축합니다. 이를 통해 도구 재사용과 같은 유사한 컨텍스트에서의 일반화 능력을 향상시키는 새로운 상태 추상화를 제공합니다. 결과적으로, 에이전트는 과거 데이터를 통해 명시적인 지식을 추출하고, 동시에 트레jectory 간의 상관관계를 활용하여 보상 신호를 밀집화합니다. 우리는 지식 추론 및 수학 과제에서 우리의 프레임워크를 평가하여, 보다 실용적이고 효율적인 학습을 달성하는 데 효과적임을 입증했습니다.

Original Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!