2602.04879v1 Feb 04, 2026 cs.LG

LLM 강화 학습에서의 신뢰 영역 재고: 근접 정책 최적화(PPO)에 대한 새로운 접근

Rethinking the Trust Region in LLM Reinforcement Learning

Penghui Qi

Citations: 1,209

h-index: 9

Xiangxin Zhou

Citations: 40

h-index: 3

Zi-Yan Liu

Citations: 1,319

h-index: 12

Tianyu Pang

Citations: 429

h-index: 11

Chao Du

Citations: 1,542

h-index: 14

Min Lin

Citations: 2,386

h-index: 21

Wee Sun Lee

Citations: 1,037

h-index: 6

강화 학습(RL)은 대규모 언어 모델(LLM)의 미세 조정에 중요한 역할을 하며, 근접 정책 최적화(PPO)는 사실상의 표준 알고리즘으로 자리 잡았습니다. 하지만, 본 논문에서는 PPO의 핵심적인 비율 클리핑 메커니즘이 LLM의 방대한 어휘에 구조적으로 적합하지 않다고 주장합니다. PPO는 샘플링된 토큰의 확률 비율을 기반으로 정책 업데이트를 제한하는데, 이는 실제 정책 차이의 노이즈가 많은 단일 샘플 몬테카를로 추정값으로 작용합니다. 이로 인해 최적 이하의 학습 동역학이 발생합니다. 즉, 낮은 확률 토큰에 대한 업데이트는 과도하게 억제되는 반면, 잠재적으로 치명적인 변화를 초래할 수 있는 높은 확률 토큰에 대한 제약은 충분하지 않아, 이는 학습 효율성과 안정성을 저해합니다. 이러한 문제를 해결하기 위해, 본 논문에서는 힐러리즘(heuristic) 클리핑 대신, 실제 정책 차이를 직접적으로 추정하는(예: 총 변동 또는 KL 발산) 보다 체계적인 제약을 사용하는 다이버전스 근접 정책 최적화(DPPO)를 제안합니다. 또한, 막대한 메모리 사용량을 줄이기 위해, 필수적인 차이를 효율적으로 추정하고, 무시할 만한 오버헤드를 갖는 이진(binary) 및 Top-K 근사 방법을 도입했습니다. 광범위한 실험적 평가 결과, DPPO는 기존 방법보다 우수한 학습 안정성과 효율성을 달성하며, RL 기반 LLM 미세 조정에 더욱 견고한 기반을 제공하는 것으로 나타났습니다.

Original Abstract

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.

3 Citations

0 Influential

10.5 Altmetric

55.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!