2603.08561v3 Mar 09, 2026 cs.AI

RetroAgent: 복기 기반 이중 내재적 피드백을 통한 문제 해결 및 진화

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

Zi-Yan Liu

Citations: 1,319

h-index: 12

Xiaoying Zhang

Citations: 72

h-index: 3

Yipeng Zhang

Citations: 3

h-index: 1

Xia Hu

Citations: 9

h-index: 2

Wenqi Shao

Citations: 4

h-index: 1

대규모 언어 모델(LLM) 기반 에이전트에 대한 표준 강화 학습(RL)은 일반적으로 외부 작업 성공 보상을 최적화하며, 지속적인 적응보다 일회성 작업 해결을 우선시합니다. 그 결과, 에이전트는 제한적인 탐색으로 인해 최적 이하의 정책으로 수렴할 수 있으며, 축적된 경험은 모델 파라미터에 암묵적으로 저장되어 효율적인 경험 학습을 방해합니다. 인간의 사후 개선 능력에서 영감을 받아, 우리는 RetroAgent라는 온라인 RL 프레임워크를 소개합니다. RetroAgent는 에이전트가 복잡한 대화형 환경에서 작업을 해결하는 것뿐만 아니라, 외부 작업 성공 보상과 사후 이중 내재적 피드백의 공동 지침 하에 진화할 수 있도록 합니다. 구체적으로, RetroAgent는 다음과 같은 특징을 가진 후회 기반 자기 성찰 메커니즘을 포함합니다. (1) 약어 추적 메커니즘을 통해 잠재적인 탐색을 보상하는 수치형 내재적 피드백, 그리고 (2) 재사용 가능한 교훈을 메모리 버퍼에 저장하고, 제안된 Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) 전략을 통해 관련성, 유용성 및 탐색을 균형 있게 조절하는 언어형 내재적 피드백. 네 가지 어려운 에이전트 작업에 대한 광범위한 실험 결과, RetroAgent는 최첨단(SOTA) 성능을 달성했으며, RL 미세 조정, 메모리 기반 RL, 탐색 가이드 RL 및 메타-RL 방법보다 훨씬 뛰어난 성능을 보였습니다. 예를 들어, ALFWorld에서 +18.3%, WebShop에서 +15.4%, Sokoban에서 +27.1%, MineSweeper에서 +8.9%의 성능 향상을 보였으며, 동시에 강력한 테스트 시간 적응 및 일반화 능력을 유지했습니다.

Original Abstract

Standard reinforcement learning (RL) for large language model (LLM)-based agents typically optimizes extrinsic task-success rewards, prioritizing one-off task solving over continual adaptation. As a result, agents may converge to suboptimal policies due to limited exploration, and accumulated experience remains implicitly stored in model parameters, hindering efficient experiential learning. Inspired by humans' capacity for retrospective self-improvement, we introduce RetroAgent, an online RL framework that enables agents to master complex interactive environments not only by solving, but also by evolving under the joint guidance of extrinsic task-success rewards and retrospective dual intrinsic feedback. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces: (1) intrinsic numerical feedback, which tracks incremental subtask completion relative to prior attempts to reward promising exploration; and (2) intrinsic language feedback, which distills reusable lessons into a memory buffer retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy, jointly balancing relevance, utility, and exploration. Extensive experiments across four challenging agentic tasks show that RetroAgent achieves state-of-the-art (SOTA) performance, substantially outperforming RL fine-tuning, memory-augmented RL, exploration-guided RL, and meta-RL methods -- e.g., exceeding Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while maintaining strong test-time adaptation and out-of-distribution generalization.

2 Citations

0 Influential

6 Altmetric

32.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!