2603.07197v1 Mar 07, 2026 cs.AI

Re²: 재해석을 통한 강화 학습으로 LLM의 추론 능력 향상

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Min Zhang

Citations: 473

h-index: 11

Dong Li

Citations: 175

h-index: 5

Jianye Hao

Citations: 184

h-index: 5

Juntao Li

Citations: 2,739

h-index: 24

Pinzheng Wang

Soochow University

Citations: 100

h-index: 6

Shulin Xu

Citations: 69

h-index: 3

Yuxi-ang Luo

Citations: 45

h-index: 3

검증 가능한 보상을 활용한 강화 학습(RLVR)은 테스트 시간 동안의 계산량을 늘려 대규모 언어 모델(LLM)의 추론 성능을 향상시키는 데 유망한 결과를 보여주었습니다. 그러나 광범위한 RLVR 훈련 후에도 이러한 모델은 여전히 체인-오브-소트(CoT) 과정에서 불필요하고 품질이 낮은 단계를 생성하는 경향이 있으며, 이는 비효율적인 과도한 사고로 이어져 답변의 품질을 저하시킵니다. 본 연구에서는 초기 CoT의 방향이나 품질이 최적이 아닐 때, 모델이 올바른 답변에 도달하지 못하는 경우가 발생하며, 심지어는 초기 CoT가 잘 설정된 경우보다 훨씬 많은 토큰을 생성해야 하는 상황이 발생한다는 것을 보여줍니다. 이에, 본 연구에서는 LLM이 비생산적인 추론 경로를 유연하게 포기하고 필요한 경우 해결 과정을 재시작하도록 학습하는 강화 학습 with 재해석(Re²)을 제안합니다. Re²는 사전 훈련 없이 순수한 강화 학습만을 사용하며, 기존 모델에서 나타나는 드물은 재시도 동작의 비율을 0.5%에서 30% 이상으로 크게 향상시킵니다. 이는 동일한 훈련 계산 예산 하에서 표준 RLVR보다 상당한 성능 향상을 가져오며, 테스트 시간 성능 또한 샘플 수가 증가함에 따라 뚜렷한 개선을 보입니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.

5 Citations

0 Influential

12 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!