2604.06628v1 Apr 08, 2026 cs.AI

추론 SFT에서의 일반화 재고: 최적화, 데이터 및 모델 역량에 대한 조건부 분석

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Yujia Liu

Citations: 0

h-index: 0

Qihan Ren

Citations: 382

h-index: 7

Shuai Shao

Citations: 71

h-index: 4

Yuejin Xie

Citations: 48

h-index: 4

Peng Wang

Citations: 29

h-index: 3

Dadi Guo

Citations: 21

h-index: 3

Jing Shao

Citations: 74

h-index: 3

Xia Hu

Citations: 30

h-index: 3

Dongrui Liu

Citations: 49

h-index: 3

Yafu Li

Citations: 370

h-index: 8

Quanshi Zhang

Citations: 23

h-index: 2

LLM의 추가 훈련에 대한 일반적인 관점은 지도 미세 조정(SFT)은 암기하는 반면, 강화 학습(RL)은 일반화한다는 것이다. 본 연구에서는 긴 연쇄적 사고(CoT) 감독을 사용하는 추론 SFT에 대해 이러한 주장을 재검토한 결과, 일반화가 부재하는 것이 아니라 조건적이며, 최적화 동역, 훈련 데이터 및 기본 모델 역량에 의해 공동으로 결정된다는 것을 발견했다. 보고된 일부 실패 사례는 최적화 부족으로 인한 현상으로, 교차 도메인 성능은 먼저 저하된 후, 추가 훈련을 통해 회복되고 개선되는 경향을 보인다(침체 및 회복 패턴). 따라서 짧은 훈련 시점에서는 일반화 능력을 과소평가할 수 있다. 데이터의 품질과 구조 모두 중요하며, 낮은 품질의 솔루션은 일반화에 광범위하게 부정적인 영향을 미치는 반면, 검증된 긴 CoT 추론은 일관된 교차 도메인 성능 향상을 가져온다. 모델 역량 또한 필수적이다. 강력한 모델은 간단한 산술 게임에서도 전이 가능한 절차적 패턴(예: 되돌리기)을 내재화하는 반면, 약한 모델은 표면적인 표현을 모방한다. 그러나 이러한 일반화는 비대칭적이며, 추론 능력은 향상되는 반면, 안전성은 저하되므로, 일반화되는 추론 SFT가 존재하는지 여부 대신, 어떤 조건에서 얼마나 많은 비용으로 일반화되는지를 고려해야 한다.

Original Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

3 Citations

0 Influential

4 Altmetric

23.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!