2602.01826v1 Feb 02, 2026 cs.LG

정확성 그 이상: 학습-추론 불일치는 최적화 문제이며, 간단한 학습률 스케줄링으로 이를 해결할 수 있다

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Jiawei Xu

Citations: 178

h-index: 5

Yingru Li

Citations: 531

h-index: 6

Jiacai Liu

Citations: 50

h-index: 4

Yaxiang Zhang

Citations: 11

h-index: 2

Qian Liu

Citations: 76

h-index: 4

Haoyuan Li

Citations: 53

h-index: 3

Ziniu Li

Citations: 325

h-index: 9

대규모 언어 모델을 학습시키는 강화 학습(RL)은 매우 불안정한 것으로 알려져 있습니다. 최근 연구에서는 이러한 불안정성이 일관성 없는 하이브리드 엔진에서 발생하는 "학습-추론 불일치" 때문이라고 설명합니다. 그러나 중요 샘플링과 같은 일반적인 해결 방법은 장기간의 학습 과정에서 효과가 없을 수 있습니다. 본 연구에서는 최적화 관점에서 이러한 불안정성을 분석하고, 학습 진행에 따라 기울기 노이즈와 학습-추론 불일치가 함께 증가한다는 것을 보여줍니다. 또한, 업데이트 크기를 줄이면 불일치를 효과적으로 억제할 수 있다는 것을 발견했습니다. 종합적으로, 불일치는 단순히 정적인 수치 차이가 아니라, 모델의 최적화 과정과 결합된 동적인 실패라는 것을 추론했습니다. 이러한 통찰력을 바탕으로, 간단하면서도 효과적인 해결책인 특수 학습률(LR) 스케줄러를 제안합니다. 기존의 학습률 스케줄러는 미리 정의된 감소 일정을 사용하는 반면, 저희 방법은 응답 길이를 기반으로 학습률 감소를 동적으로 트리거합니다. 응답 길이는 잠재적인 불안정성에 대한 신뢰할 수 있는 조기 경고 신호로 작용합니다. 실험 결과, 기울기 노이즈가 증가함에 따라 학습률을 줄이면 강화 학습의 안정성을 지속적으로 향상시키고 학습-추론 불일치를 안전한 수준으로 유지할 수 있다는 것을 확인했습니다.

Original Abstract

Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to "training inference mismatch stemming" from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.

7 Citations

0 Influential

4.5 Altmetric

29.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!