2601.06794v1 Jan 11, 2026 cs.AI

더 이상 낡은 피드백은 없다: 오픈 월드 에이전트 학습을 위한 비평가 공진화

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Zhicong Li

Citations: 47

h-index: 3

Yulan Hu

Citations: 74

h-index: 5

Yixia Li

Southern University of Science and Technology

Citations: 190

h-index: 7

Xiangwen Zhang

Citations: 11

h-index: 2

Guanhua Chen

Citations: 65

h-index: 3

Zheng Pan

Citations: 53

h-index: 3

Xin Li

Citations: 16

h-index: 2

Yong Liu

Citations: 13

h-index: 2

Lingjie Jiang

Citations: 78

h-index: 3

Xingchen Zeng

Hong Kong University of Science and Technology (Guangzhou)

Citations: 94

h-index: 4

비평 기반 강화학습(RL)은 희소한 결과 보상을 자연어 피드백으로 보강하여 LLM 에이전트를 학습시키는 강력한 패러다임으로 부상했습니다. 그러나 현재의 방법들은 종종 정책이 발전함에 따라 적응하지 못하는 정적 혹은 오프라인 비평가 모델에 의존합니다. 온-폴리시 RL에서는 시간이 지남에 따라 에이전트의 오류 패턴이 변화하므로, 고정된 비평가는 낡은 것이 되어 효용이 떨어지는 피드백을 제공하게 됩니다. 이를 해결하기 위해, 우리는 동기화된 공진화 루프를 통해 정책과 비평가를 공동으로 최적화하는 프레임워크인 ECHO(하인드사이트 기반 최적화를 위한 진화하는 비평가)를 소개합니다. ECHO는 연쇄 롤아웃 메커니즘을 활용하여, 비평가가 초기 궤적에 대해 여러 진단을 생성하고 이후 정책 개선을 수행함으로써 그룹 구조의 어드밴티지 추정을 가능하게 합니다. 또한, 고성능 궤적에서 점진적인 개선을 유도하는 비평가에게 보상을 주는 포화 인지 이득 형성 목적 함수를 통해 학습 정체 문제를 해결합니다. ECHO는 이중 트랙 GRPO 업데이트를 사용하여 비평가의 피드백이 진화하는 정책과 동기화되도록 보장합니다. 실험 결과, ECHO는 오픈 월드 환경 전반에서 더 안정적인 학습과 높은 장기 과제 성공률을 달성하는 것으로 나타났습니다.

Original Abstract

Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!