2604.14646v2 Apr 16, 2026 cs.AI

강화 학습을 위한 통합 엔트로피 제어를 통한 목표 지향적 탐색

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

Lai Wei

Shanghai Jiao Tong University

Citations: 208

h-index: 7

Weiran Huang

Citations: 88

h-index: 5

Yanzhi Zhang

Citations: 28

h-index: 2

Chenyang Shao

Citations: 8

h-index: 2

Zedong Dan

Citations: 5

h-index: 1

GeGe Lan

Citations: 0

h-index: 0

Chen Wang

Citations: 2,804

h-index: 6

Yue Wang

Citations: 92

h-index: 5

최근 강화 학습(RL) 분야의 발전은 대규모 언어 모델(LLM)과 시각-언어 모델(VLM)의 추론 능력을 향상시켰습니다. 그러나 널리 사용되는 Group Relative Policy Optimization (GRPO)은 엔트로피 붕괴 현상으로 인해 정책이 조기에 수렴하고 다양성을 잃는 문제가 지속적으로 발생합니다. 기존의 탐색 방법은 탐색 과정에서 추가적인 편향 또는 분산을 유발하여 최적화의 안정성을 유지하기 어렵게 만듭니다. 본 논문에서는 강화 학습을 위한 통합 엔트로피 제어(Unified Entropy Control for Reinforcement Learning, UEC-RL) 프레임워크를 제안합니다. UEC-RL은 탐색 및 안정화를 위한 목표 지향적인 메커니즘을 제공합니다. UEC-RL은 잠재적이고 가치 있는 추론 경로를 탐색하기 위해 어려운 프롬프트에서 더 많은 탐색을 활성화합니다. 동시에, 안정화 장치는 엔트로피가 통제 불능으로 증가하는 것을 방지하여 모델이 안정적인 행동을 학습하는 동안 전체적인 훈련 안정성을 유지합니다. 이러한 구성 요소들은 필요에 따라 탐색 공간을 확장하면서 훈련 전반에 걸쳐 강력한 최적화를 유지합니다. LLM 및 VLM 추론 작업에 대한 실험 결과, UEC-RL은 Pass@1 및 Pass@$k$ 모두에서 기존 RL 방법보다 일관되게 성능 향상을 보였습니다. 특히 Geometry3K 데이터셋에서 UEC-RL은 GRPO에 비해 37.9%의 상대적인 성능 향상을 달성했으며, 이는 UEC-RL이 수렴을 저해하지 않으면서 효과적인 탐색을 유지하며, 대규모 모델에서 RL 기반 추론을 확장하는 데 중요한 역할을 한다는 것을 보여줍니다. 본 연구의 코드는 https://github.com/597358816/UEC-RL 에서 확인할 수 있습니다.

Original Abstract

Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.

0 Citations

0 Influential

30.431471805599 Altmetric

152.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!