2604.01591v1 Apr 02, 2026 cs.AI

ThinkTwice: 추론 및 자체 개선을 위한 대규모 언어 모델의 공동 최적화

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Zhenwei Tang

Citations: 33

h-index: 2

Ashton Anderson

Citations: 77

h-index: 6

Difan Jiao

Citations: 31

h-index: 2

Qianfeng Wen

Citations: 31

h-index: 4

Blair Yang

Citations: 9

h-index: 2

본 논문에서는 Group Relative Policy Optimization (GRPO)을 기반으로 추론 문제를 해결하고 답변을 개선하기 위해 대규모 언어 모델(LLM)을 공동으로 최적화하는 간단한 두 단계 프레임워크인 ThinkTwice를 소개합니다. ThinkTwice는 각 훈련 단계 쌍에서 먼저 모델을 추론 문제 해결에 최적화하고, 그런 다음 동일한 이진 정확도 보상을 사용하여 동일한 문제에 대한 자체 솔루션을 개선하는 데 모델을 최적화합니다. 이때 정확성 신호 또는 비평 주석 없이 진행됩니다. Qwen3-4B 및 Olmo3-7B를 포함한 다섯 가지 수학적 추론 벤치마크 및 두 가지 모델 패밀리에 걸쳐, ThinkTwice는 경쟁적인 온라인 정책 최적화 기준보다 추론 및 개선 성능을 크게 향상시킵니다. 특히, Qwen3-4B 모델에서 ThinkTwice는 GRPO보다 AIME 문제에서 개선 단계를 거치기 전 5%p 더 높은 성능을 보였고, 한 번의 자체 개선 단계 후에는 11.5%p 더 높은 성능을 보였습니다 (pass@4 기준). ThinkTwice의 훈련 동역학 분석 결과, '수정 후 강화'라는 암묵적인 커리큘럼이 나타나는 것을 확인했습니다. 개선 단계는 훈련 초기에 주로 오류를 수정하며, 모델이 개선됨에 따라 이미 정확한 솔루션을 보존하는 방향으로 자연스럽게 전환되어, 더욱 정확한 보상 신호를 제공합니다. 본 연구는 추론과 자체 개선을 공동으로 훈련하는 것이 강화 학습 기반 언어 모델(RLVR)에 대한 체계적이고 효과적인 방법론임을 입증합니다.

Original Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!