2605.00425v1 May 01, 2026 cs.AI

AEM: 다중 단계 에이전트 강화 학습을 위한 적응형 엔트로피 변조

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hao-Dong Zhao

Citations: 1,261

h-index: 3

Daxiang Dong

Citations: 27

h-index: 3

Lun Tian

Citations: 1

h-index: 1

Tianshun Zhu

Citations: 2

h-index: 1

Jianmin Wu

Citations: 20

h-index: 2

Wenyu Zhang

Citations: 267

h-index: 7

Yuchen Zeng

Citations: 80

h-index: 4

Songlin Zhou

Citations: 129

h-index: 3

Yuxin Zhang

Citations: 38

h-index: 2

Stephen S.-T. Yau

Citations: 0

h-index: 0

Yifeng Huang

Citations: 6

h-index: 1

Jing Gu

Citations: 25

h-index: 3

강화 학습(RL)은 대규모 언어 모델(LLM) 에이전트가 환경과 상호 작용하고 다중 단계 작업을 해결하는 능력을 크게 향상시켰습니다. 그러나 효과적인 학습은 여전히 어렵습니다. 희소하고 결과 중심적인 보상은 에이전트의 행동 경로에서 개별 단계에 대한 책임을 할당하기 어렵게 만듭니다. 일반적인 해결책은 공정 보상 모델이나 보조 자기 지도 신호와 같은 밀집된 중간 감독 신호를 도입하는 것입니다. 하지만 이는 감독 및 튜닝 복잡성을 증가시키고 종종 작업 및 도메인 간 일반화 성능이 좋지 않습니다. 본 논문에서는 AEM이라는 감독이 필요 없는 신용 할당 방법을 제시합니다. AEM은 RL 훈련 중에 엔트로피 동역학을 적응적으로 변조하여 보다 효과적인 탐험-활용 균형을 달성합니다. 이론적으로, 우리는 토큰 수준의 엔트로피 분석을 응답 수준으로 확장하여 토큰 샘플링 변동을 줄이고, 자연 기울기 하에서 엔트로피 드리프트는 본질적으로 장점과 상대 응답 놀라움의 곱에 의해 결정된다는 것을 보여줍니다. 특히, 우리는 실용적인 근사치를 도출하여 훈련 동역학을 재구성하고, 자연스러운 탐험에서 활용으로의 전환을 가능하게 합니다. 1.5B에서 32B 파라미터에 이르는 다양한 벤치마크 및 모델에 대한 광범위한 실험을 통해 AEM의 효과성을 입증했으며, 특히 매우 어려운 SWE-bench-Verified 벤치마크에서 최첨단 모델에 통합했을 때 1.4%의 상당한 성능 향상을 보였습니다.

Original Abstract

Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!