2602.05494v2 Feb 05, 2026 cs.LG

GRPO에서 정책 다양성 측정 방식 재검토를 위한 통합 프레임워크

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Qingyuan Wu

Citations: 50

h-index: 4

Yuhui Wang

Citations: 31

h-index: 4

S. Zhan

Citations: 149

h-index: 5

Yan Dai

Citations: 29

h-index: 2

Shilong Deng

Citations: 8

h-index: 2

Sarra Habchi

Citations: 373

h-index: 10

Qi Zhu

Citations: 121

h-index: 5

Chao Huang

Citations: 13

h-index: 3

Matthias Gall'e

Citations: 126

h-index: 4

검증된 보상을 사용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 중요한 패러다임으로 부상했습니다. GRPO 및 그 변형과 같은 대부분의 기존 RLVR 방법은 정책 다양성을 제한하는 클리핑(clipping)을 통해 안정적인 업데이트를 보장합니다. 본 논문에서는 일반적인 정책 다양성 개념을 통해 기존 방법을 특징짓는 통합 클리핑 프레임워크를 제시합니다. 이 프레임워크는 가능도 비율(likelihood ratio)과 쿨백-라이블러 발산(Kullback-Leibler divergence, KL 발산)을 모두 포괄하며, 다른 측정 방식까지 확장할 수 있습니다. 이 프레임워크는 다양한 정책 다양성 측정 방식이 탐색과 성능에 미치는 영향을 체계적으로 분석하기 위한 기반을 제공합니다. 또한, KL 발산의 분산 감소 몬테카를로 추정기인 KL3 추정기가 핵심적인 정책 다양성 제약 조건임을 밝혀냅니다. 이론적으로 KL3 기반 제약 조건은 수학적으로 비대칭 비율 기반 클리핑과 동일하며, 이는 고신뢰도 행동에 확률 질량을 재분배하여 강력한 탐색을 촉진하면서도 GRPO 스타일 방법의 단순성을 유지합니다. 수학적 추론 벤치마크에서의 실험 결과는 GRPO에 KL3 추정기를 통합하면 학습 안정성과 최종 성능이 모두 향상된다는 것을 보여주며, 정책 최적화에서 원칙적인 정책 다양성 제약 조건의 중요성을 강조합니다.

Original Abstract

Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!