2601.03320v1 Jan 06, 2026 cs.LG

효율적인 LLM 미세 조정: 비율-분산 정규화 정책 최적화

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

Shuo Han

Citations: 11

h-index: 2

Yu-Mei Luo

Citations: 67

h-index: 2

Yihan Hu

Citations: 56

h-index: 4

Dong Li

Citations: 175

h-index: 5

Jianye Hao

Citations: 184

h-index: 5

온라인 강화 학습(RL), 특히 Proximal Policy Optimization (PPO) 및 Group Relative Policy Optimization (GRPO)은 대규모 언어 모델(LLM)의 미세 조정에 널리 사용되는 방법론입니다. 정책 비율 클리핑은 학습 안정성을 높이지만, 이 휴리스틱 기반의 엄격한 제약 조건은 근본적인 비용을 발생시킵니다. 즉, 높은 보상을 제공하지만 높은 분산을 갖는 행동에서 발생하는 기울기를 무차별적으로 잘라내어 복잡한 추론 과정에서 중요한 정보(희귀하지만 매우 유용한 '깨달음' 순간)를 억제합니다. 또한, 데이터가 약간이라도 오래된 경우, 엄격한 클리핑은 해당 데이터를 사용할 수 없게 만들어 샘플 효율성을 크게 저하시킵니다. 본 연구에서는 정책 최적화의 신뢰 영역 목적 함수를 재검토하고, 정책 비율의 extit{분산 (두 번째 중심 모멘트)}을 명시적으로 제한하는 것이 엄격한 클리핑에 대한 체계적이고 부드러운 완화 방법임을 보여줍니다. 이 분포적 제약 조건은 정책 업데이트를 안정화시키는 동시에 유용한 경로에서 발생하는 기울기 신호를 유지합니다. 이러한 통찰력을 바탕으로, 우리는 $R^2VPO$ (Ratio-Variance Regularized Policy Optimization)라는 새로운 원-이중 프레임워크를 제안합니다. $R^2VPO$는 안정적인 온라인 학습을 지원하며, 데이터를 버리는 대신 오래된 샘플을 동적으로 재가중하여 체계적인 오프라인 데이터 재사용을 가능하게 합니다. 우리는 DeepSeek-Distill-Qwen-1.5B 및 openPangu-Embedded 시리즈 (1B 및 7B)를 포함한 최첨단 LLM을 어려운 수학적 추론 벤치마크를 사용하여 미세 조정하는 과정에서 $R^2VPO$를 광범위하게 평가했습니다. 실험 결과, $R^2VPO$는 일관되게 뛰어난 성능을 보이며, 강력한 클리핑 기반의 기준 모델보다 평균적으로 최대 17%의 성능 향상을 달성했습니다. 또한, 수렴에 필요한 샘플 수를 약 50% 줄였습니다. 이러한 결과는 비율-분산 제어가 RL 기반 LLM 정렬의 안정성과 데이터 효율성을 향상시키는 유망한 방향임을 시사합니다.

Original Abstract

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative "eureka moments" in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose $R^2VPO$ (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate $R^2VPO$ on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that $R^2VPO$ consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.

4 Citations

0 Influential

2.5 Altmetric

16.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!