2601.01580v1 Jan 04, 2026 cs.LG

두 단계 결정-샘플링 가설: 강화 학습을 통해 훈련된 LLM에서 자기 성찰 능력의 발현 이해

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Xingcheng Xu

Citations: 27

h-index: 4

Zibo Zhao

Citations: 1

h-index: 1

Yuanting Zha

Citations: 0

h-index: 0

Haipeng Zhang

Citations: 3

h-index: 1

대규모 언어 모델(LLM)은 강화 학습(RL) 후처리 훈련을 통해 자기 성찰 능력을 갖추게 되며, 다단계 RL은 단순 지도 학습(SFT) 방식보다 훨씬 더 큰 성능 향상을 보여줍니다. 그러나 단일 최적화 목표가 솔루션 생성과 수정 시점을 평가하는 기능적으로 구별되는 능력의 발현으로 이어지는 메커니즘은 여전히 불분명합니다. 이 문제를 해결하기 위해, 우리는 정책 구성 요소에 걸쳐 보상 기울기가 어떻게 분포되는지를 특성화하는 기울기 귀속 속성을 도입하고, 이를 두 단계 결정-샘플링(DS) 가설을 통해 공식화합니다. 이 가설은 정책을 생성(πsample)을 위한 샘플링 단계와 검증을 위한 결정(πd) 단계로 분해합니다. 우리는 대체 보상이 균형 잡힌 기울기 귀속을 나타내는 반면, SFT와 KL 페널티는 불균형 잡힌 기울기 귀속을 나타낸다는 것을 증명합니다. 또한, 길이 가중치는 πsample을 제약하는 비대칭적인 정규화를 만들어 πd가 최적화되지 않은 상태를 초래하며, 이것이 RL이 SFT가 실패하는 곳에서 성공하는 이유에 대한 이론적 설명을 제공합니다. 또한, 우리는 산술 추론에 대한 이론적 예측을 경험적으로 검증하고, RL의 우수한 일반화 성능은 주로 샘플링 능력보다는 의사 결정 능력(πd)의 향상에서 비롯된다는 것을 보여줍니다. 이는 사고 모델에서 자기 교정의 기본 메커니즘에 대한 설명을 제공합니다.

Original Abstract

Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ($π_{sample}$) for generation and decision ($π_{d}$) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $π_{sample}$ while leaving $π_{d}$ under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL's superior generalization stems primarily from improved decision-making ($π_{d}$) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!