2604.03098v1 Apr 03, 2026 cs.LG

언어 에이전트의 정책과 내부 보상의 공진화

Co-Evolution of Policy and Internal Reward for Language Agents

Tung Sum Thomas Kwok

Citations: 25

h-index: 2

Chenglin Wu

Citations: 352

h-index: 7

Yuyu Luo

Citations: 530

h-index: 8

Jiayi Zhang

Citations: 920

h-index: 11

Fanqi Kong

Citations: 51

h-index: 5

Xinyu Wang

Citations: 54

h-index: 3

Hanwei Wu

Citations: 0

h-index: 0

Jingwei Song

Citations: 4

h-index: 1

Shuyuan Zhang

Citations: 481

h-index: 4

Xiaonan Chang

Citations: 0

h-index: 0

Bang Liu

Citations: 4

h-index: 1

대규모 언어 모델(LLM) 에이전트는 환경과의 상호작용을 통해 학습하지만, 장기적인 학습은 여전히 희소하고 지연된 보상으로 인해 근본적인 병목 현상을 겪습니다. 기존 방법들은 일반적으로 사후 신용 할당 또는 외부 보상 모델을 통해 이러한 문제를 해결하는데, 이는 추론 시 제한적인 가이드를 제공하며 종종 보상 개선과 정책 개선을 분리합니다. 본 연구에서는 언어 에이전트를 위한 자체 생성 내부 보상 시스템인 Self-Guide를 제안합니다. Self-Guide는 추론 시 다음 행동을 안내하는 단기적인 자기 가이드 신호로 사용되며, 동시에 학습 시 더 촘촘한 정책 최적화를 위한 단계별 내부 보상으로 변환됩니다. 이를 통해 정책 개선은 더 나은 가이드를 생성하고, 더 나은 가이드는 다시 내부 보상을 통해 정책을 개선하는 공진화 루프를 형성합니다. 세 가지 에이전트 벤치마크에서 추론 시 자기 가이드는 이미 상당한 성능 향상을 가져왔으며, GRPO를 사용하여 정책과 내부 보상을 공동으로 발전시키면 환경 보상만으로 학습된 기준 모델 대비 8%의 추가적인 성능 향상을 얻을 수 있었습니다. 전반적으로, 본 연구 결과는 언어 에이전트가 단순히 더 많은 경험을 수집하는 것뿐만 아니라, 행동 및 학습 과정에서 자체적인 내부 보상을 생성하고 개선하는 능력 또한 성능 향상에 중요한 역할을 한다는 것을 시사합니다.

Original Abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!