2605.07331v1 May 08, 2026 cs.LG

LLM 정책 최적화에서의 중요 샘플링 재고: 누적 토큰 관점

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Yuheng Zhang

Citations: 186

h-index: 6

Shuowei Jin

Citations: 177

h-index: 7

Chen Ye

Citations: 813

h-index: 11

Changlong Yu

Citations: 240

h-index: 7

Wei Xiong

Citations: 74

h-index: 2

Saurabh Sahu

Citations: 66

h-index: 2

Nan Jiang

Citations: 1,092

h-index: 11

강화 학습, 특히 검증 가능한 보상을 활용한 강화 학습(RLVR)은 LLM의 추가 학습에 강력한 접근 방식으로 부상했습니다. 이러한 접근 방식의 핵심은 오프라인 정책 경사 추정에서 사용되는 중요 샘플링(IS) 비율을 설계하는 것입니다. 기존 방법은 근본적인 편향-분산 딜레마에 직면합니다. PPO(Schulman et al., 2017) 및 GRPO(Shao et al., 2024)에서 채택된 토큰 수준의 IS 비율은 접두사 상태 분포 불일치를 무시하여 편향을 발생시킵니다. 전체 시퀀스 비율은 정확한 궤적 수준의 보정을 제공하지만, 토큰별 비율의 곱셈 누적 때문에 높은 분산을 겪습니다. GSPO(Zheng et al., 2025)는 길이 정규화를 통해 수치적 안정성을 향상시키지만, 정확한 전체 시퀀스 IS 보정을 벗어나는 단점이 있습니다. 본 연구에서는 누적 토큰 IS 비율, 즉 위치 $t$까지의 토큰별 비율의 곱을 이 딜레마에 대한 이론적으로 타당한 해결책으로 제시합니다. 토큰 수준의 정책 경사 공식 하에서 이 비율은 각 토큰 수준 경사 항에 대한 편향 없는 접두사 보정을 제공하며, 전체 시퀀스 비율보다 엄격하게 낮은 분산을 갖는다는 것을 증명합니다. 이러한 통찰력을 바탕으로, 누적 토큰 IS 비율과 함께 누적 로그 비율의 자연스러운 $\sqrt{t}$ 증가에 따라 로그 공간 클리핑 경계를 조정하는 위치 적응형 클리핑을 결합한 CTPO(Cumulative Token Policy Optimization)를 제안합니다. 이를 통해 토큰 위치에 걸쳐 더욱 일관된 정규화가 가능합니다. 저희는 CTPO를 도구 통합 추론 환경에서 여러 가지 어려운 수학적 추론 벤치마크에 적용하고 평가했으며, 다양한 모델 크기에서 GRPO 및 GSPO와 같은 강력한 기준 모델보다 뛰어난 평균 성능을 달성했습니다. 코드의 공개는 https://github.com/horizon-llm/CTPO 에서 확인할 수 있습니다.

Original Abstract

Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.

0 Citations

0 Influential

25.5 Altmetric

127.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!