2603.04918v1 Mar 05, 2026 cs.LG

BandPO: 확률 인지 경계를 활용하여 신뢰 영역과 비율 클리핑을 연결하는 LLM 강화 학습 방법

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Xipeng Qiu

Citations: 8

h-index: 2

Zhangyue Yin

Fudan University

Citations: 3,857

h-index: 19

Yufei Gao

Citations: 36

h-index: 2

Yuan Li

Citations: 146

h-index: 2

Boyu Wang

Citations: 7

h-index: 2

Yuqian Yao

Citations: 2

h-index: 1

Xinyuan Wang

Citations: 182

h-index: 7

대규모 언어 모델 강화 학습의 안정성은 근접 제약 조건에 기반합니다. PPO에서 일반적으로 사용되는 클리핑 메커니즘은 신뢰 영역을 효과적으로 대체하지만, 고정된 경계는 낮은 확률의 행동에 대한 업데이트 범위를 과도하게 제한하여 높은 보상을 제공하는 전략을 억제하고 엔트로피 급격한 감소를 유발하는 중요한 문제점을 안고 있습니다. 이러한 문제를 해결하기 위해 Band-constrained Policy Optimization (BandPO)를 제안합니다. BandPO는 기존의 클리핑 방식을 Band라는 통일된 이론적 연산자로 대체하며, 이 연산자는 f-다이버전스로 정의된 신뢰 영역을 동적이고 확률에 민감한 클리핑 구간으로 투영합니다. 이론적 분석 결과, Band는 이러한 탐색 병목 현상을 효과적으로 해결하는 것으로 나타났습니다. 이 변환 과정을 볼록 최적화 문제로 정의하여 전역적으로 최적의 수치 해를 보장하며, 특정 다이버전스에 대해서는 해석적인 해를 도출합니다. 다양한 모델과 데이터 세트에 대한 광범위한 실험 결과, BandPO는 기존의 클리핑 방식 및 Clip-Higher 방식보다 일관되게 우수한 성능을 보이며, 엔트로피 급격한 감소를 효과적으로 완화합니다.

Original Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!