2603.22117v1 Mar 23, 2026 cs.LG

LLM 추론을 위한 강화 학습 기반 검증 가능한 보상(RLVR) 업데이트 방향 연구: 식별 및 활용

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Kexin Huang

Citations: 71

h-index: 5

Haoming Meng

Citations: 52

h-index: 5

Jinda Lu

Citations: 59

h-index: 5

Chiyu Ma

Citations: 420

h-index: 8

Ziqian Chen

Citations: 87

h-index: 5

Xue Wang

Citations: 79

h-index: 5

Bolin Ding

Citations: 88

h-index: 5

Jiancan Wu

Citations: 3,690

h-index: 25

Xiang Wang

Citations: 956

h-index: 15

Xiangnan He

Citations: 1,743

h-index: 20

Guoyin Wang

Citations: 46

h-index: 4

Jingren Zhou

Citations: 1,010

h-index: 14

Junkang Wu

Citations: 563

h-index: 11

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시켰습니다. 기존 연구에서는 RLVR로 인한 변화가 드물다는 점이 지적되었지만, 이러한 변화의 extbf{크기}에 주로 초점을 맞추고, 그 extbf{방향}은 상대적으로 간과되었습니다. 본 연구에서는 업데이트 방향이 RLVR의 효과를 이해하는 데 더욱 중요한 요소이며, 이는 기본 모델과 최종 RLVR 모델 간의 토큰 수준 로그 확률 차이($Δ ext{log }p$)를 통해 파악될 수 있다고 주장합니다. 통계적 분석과 토큰 교체 실험을 통해, $Δ ext{log }p$가 크기 기반 지표(예: 발산 또는 엔트로피)보다 더 효과적으로 드물지만 추론에 중요한 업데이트를 식별한다는 것을 입증했습니다. 이러한 통찰력을 바탕으로, 우리는 다음과 같은 두 가지 실용적인 응용 분야를 제안합니다: (1) 학습된 $Δ ext{log }p$ 방향으로 정책을 확장하여 추가 훈련 없이 추론 정확도를 향상시키는 extit{추론 시간 외삽} 방법; (2) 낮은 확률(높은 $Δ ext{log }p$에 해당)의 토큰에 학습을 집중하여 다양한 모델과 벤치마크에서 추론 성능을 향상시키는 extit{훈련 시간 재가중} 방법. 본 연구는 변화의 방향을 RLVR을 분석하고 개선하는 핵심 원칙으로 확립합니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $Δ\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $Δ\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $Δ\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

8 Citations

1 Influential

12.5 Altmetric

72.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!