2601.12720v1 Jan 19, 2026 cs.AI

대형 추론 모델에게 효과적인 성찰 가르치기

Teaching Large Reasoning Models Effective Reflection

Hanbin Wang

Citations: 3

h-index: 1

Jingwei Song

Citations: 12

h-index: 1

Jinpeng Li

Citations: 3

h-index: 1

Fei Mi

Citations: 324

h-index: 11

Lifeng Shang

Citations: 118

h-index: 5

Qi Zhu

Citations: 278

h-index: 9

Yasheng Wang

Citations: 134

h-index: 4

Ganqu Cui

Citations: 13,687

h-index: 34

최근 대형 추론 모델(LRM)은 자기 비평 및 백트래킹과 같은 자기 성찰적 행동을 통해 복잡한 추론 과제에서 인상적인 성능을 보여주었습니다. 그러나 모든 성찰이 유익한 것은 아닙니다. 많은 경우 성찰은 피상적이며, 원래 답변보다 개선되는 점이 거의 없거나 아예 없고 연산 오버헤드만 발생시킵니다. 본 논문에서는 LRM의 피상적인 성찰 문제를 파악하고 이를 해결하고자 합니다. 먼저 우리는 모델 스스로 생성한 비평만을 사용하여 모델의 성찰적 추론 능력을 향상시키는 훈련 프레임워크인 자기 비평 미세 조정(Self-Critique Fine-Tuning, SCFT)을 제안합니다. SCFT는 모델이 자신의 출력을 비평하도록 유도하고, 기각 샘플링을 통해 고품질 비평을 선별한 다음, 비평 기반 목적 함수를 사용하여 모델을 미세 조정합니다. 이러한 강력한 기반 위에, 우리는 효과적인 성찰 보상을 활용한 강화 학습(RLERR)을 추가로 도입합니다. RLERR은 SCFT로 초기화된 고품질 성찰을 활용하여 보상 신호를 구성함으로써, 강화 학습을 통해 모델이 자기 수정 과정을 내재화하도록 유도합니다. 까다로운 두 가지 벤치마크인 AIME2024와 AIME2025에 대한 실험 결과, SCFT와 RLERR은 추론 정확도와 성찰 품질을 모두 크게 향상시켜 최신 베이스라인 모델들을 능가하는 것으로 나타났습니다. 모든 데이터와 코드는 https://github.com/wanghanbinpanda/SCFT 에서 확인할 수 있습니다.

Original Abstract

Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self-reflective behaviors such as self-critique and backtracking. However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques. SCFT prompts models to critique their own outputs, filters high-quality critiques through rejection sampling, and fine-tunes the model using a critique-based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR leverages the high-quality reflections initialized by SCFT to construct reward signals, guiding the model to internalize the self-correction process via reinforcement learning. Experiments on two challenging benchmarks, AIME2024 and AIME2025, show that SCFT and RLERR significantly improve both reasoning accuracy and reflection quality, outperforming state-of-the-art baselines. All data and codes are available at https://github.com/wanghanbinpanda/SCFT.

1 Citations

0 Influential

37 Altmetric

186.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!