2602.05165v3 Feb 05, 2026 cs.LG

EBPO: 경험적 베이즈 축소법을 이용한 그룹 상대 정책 최적화의 안정화

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

Yuhang Zhou

Citations: 8

h-index: 2

Lizhu Zhang

Citations: 128

h-index: 6

Kevin Han

Citations: 49

h-index: 5

Mingze Gao

Citations: 19

h-index: 2

Gedi Zhou

Citations: 2

h-index: 1

Serena Li

Citations: 7

h-index: 2

Abhishek Kumar

Citations: 2

h-index: 1

Xiangjun Fan

Citations: 98

h-index: 5

Weiwei Li

Citations: 23

h-index: 3

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 효과적인 것으로 입증되었습니다. 그러나 그룹 상대 정책 최적화(GRPO)와 같은 주요 접근 방식은 심각한 안정성 문제를 안고 있습니다. GRPO는 계산 제약 조건(작은 그룹 크기) 하에서 높은 추정량 분산을 나타내며, 모든 응답이 동일한 0의 보상을 제공하는 포화된 실패 영역에서 기울기 신호가 사라지는 현상을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 경험적 베이즈 정책 최적화(EBPO)라는 새로운 프레임워크를 제안합니다. EBPO는 정책의 누적된 글로벌 통계로부터 정보를 활용하여 로컬 그룹 기반 기준선을 정규화합니다. EBPO는 기준선을 개별적으로 추정하는 대신, 웰포드 온라인 알고리즘을 통해 업데이트되는 글로벌 사전 지식과 로컬 그룹 통계를 동적으로 균형 있게 조정하는 축소 추정기를 사용합니다. 이론적으로, EBPO는 GRPO에 비해 평균 제곱 오차(MSE)가 엄격하게 낮고, 엔트로피 감소가 제한되며, 실패 시나리오에서 페널티 신호가 사라지지 않음을 증명합니다. 실험적으로, EBPO는 AIME 및 OlympiadBench를 포함한 다양한 벤치마크에서 GRPO 및 기타 기존 기준선보다 일관되게 우수한 성능을 보입니다. 특히, EBPO는 작은 그룹 크기에서도 높은 성능 향상을 달성하는 우수한 학습 안정성을 보이며, 난이도 계층화된 교육 커리큘럼 학습으로부터 상당한 이점을 얻습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!