2602.14872v1 Feb 16, 2026 cs.LG

능숙성의 경계에서 RLVR의 학습 동역학에 대한 연구

On the Learning Dynamics of RLVR at the Edge of Competence

Yingbin Liang

Citations: 195

h-index: 5

Yu Huang

Citations: 85

h-index: 4

Zixin Wen

Citations: 121

h-index: 4

Yuejie Chi

Citations: 304

h-index: 7

Aarti Singh

Citations: 126

h-index: 5

Yuxin Chen

Citations: 323

h-index: 8

Yuting Wei

Citations: 78

h-index: 6

검증 가능한 보상을 사용하는 강화 학습(RLVR)은 최근 대규모 추론 모델의 획기적인 발전을 이끄는 주요 동력이었습니다. 그러나 최종 결과에만 기반한 보상이 어떻게 장기적인 추론의 어려움을 극복하는 데 도움이 되는지는 여전히 미스터리입니다. 이를 이해하기 위해, 우리는 트랜스포머 모델이 합성 추론 작업에서 강화 학습을 수행할 때의 학습 동역학에 대한 이론을 개발했습니다. 우리의 이론은 RLVR의 효과가 얼마나 효과적인지 결정하는 요소가 문제 난이도 스펙트럼의 매끄러움에 의해 좌우된다는 것을 설명합니다. 데이터에 난이도 변화가 급격하게 나타나는 경우, 학습은 '그로킹(grokking)'과 유사한 단계적 변화를 겪으며, 발전이 다시 나타나기 전에 장기간의 정체기가 나타납니다. 반대로, 매끄러운 난이도 스펙트럼은 '릴레이 효과'를 유발합니다. 쉬운 문제에서 지속적인 기울기 신호는 모델의 능력을 향상시켜 더 어려운 문제를 해결할 수 있는 수준으로 끌어올리고, 이는 꾸준하고 지속적인 개선으로 이어집니다. 우리의 이론은 RLVR이 능숙성의 경계에서 성능을 향상시키는 방법을 설명하며, 적절하게 설계된 데이터 조합을 통해 확장 가능한 이점을 얻을 수 있음을 제안합니다. 기술적인 기여로, 우리는 푸리에 분석의 도구를 개발하고 유한 그룹에 대한 분석을 적용하여 우리 연구에 활용했습니다. 우리는 합성 실험을 통해 예측된 메커니즘을 경험적으로 검증했습니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!