2602.07078v1 Feb 06, 2026 cs.LG

최적 토큰 기준선: 장기 LLM-RL 학습을 위한 분산 감소

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Jiawei Xu

Citations: 178

h-index: 5

Longtao Zheng

Citations: 775

h-index: 10

Yingru Li

Citations: 531

h-index: 6

Jiacai Liu

Citations: 50

h-index: 4

Wei Liu

Citations: 34

h-index: 3

Yuxuan Tong

Citations: 7

h-index: 2

Yaxiang Zhang

Citations: 11

h-index: 2

Tianle Cai

Citations: 132

h-index: 4

Ge Zhang

Citations: 206

h-index: 3

Qian Liu

Citations: 76

h-index: 4

Baoxiang Wang

Citations: 15

h-index: 3

Zhenghai Xue

Citations: 421

h-index: 4

Ziniu Li

Citations: 325

h-index: 9

대규모 언어 모델(LLM)을 위한 강화 학습(RL)은 종종 기울기 분산이 급증하는 현상으로 인해 장기적인 작업에서 학습이 제대로 이루어지지 않는 문제를 겪습니다. 이를 완화하기 위해, 가치 계산을 위해 기준선(baseline)이 일반적으로 도입되지만, 기존의 가치 모델은 최적화하기 어렵고, 표준적인 그룹 기반 기준선은 시퀀스의 이질성을 간과합니다. 기존의 최적 기준선 이론은 전체적인 분산을 감소시킬 수 있지만, 토큰의 이질성을 고려하지 않으며, 엄청난 양의 기울기 기반 계산을 필요로 합니다. 본 연구에서는 최적 토큰 기준선(Optimal Token Baseline, OTB)을 기본적인 원리로부터 유도하여, 기울기 업데이트는 누적 기울기 크기에 반비례하여 가중되어야 함을 증명합니다. 효율성을 확보하기 위해, 순방향 패스 확률만을 사용하여 기울기 크기를 근사하는 로짓-기울기 프록시(Logit-Gradient Proxy)를 제안합니다. 저희 방법은 학습 안정성을 달성하며, 큰 그룹 크기($N=32$)와 동일한 성능을 보이면서도 $N=4$ 만으로 토큰 사용량을 65% 이상 줄여, 단일 턴 및 도구 통합 추론 작업에서 효율성을 입증합니다.

Original Abstract

Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.

4 Citations

2 Influential

5 Altmetric

33.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!