2601.09233v1 Jan 14, 2026 cs.LG

GIFT: 훈련 후 단계에서 유한 온도 Gibbs 초기화를 통해 전역 최적성을 달성하는 방법

GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization

Peng Pei

Citations: 152

h-index: 8

Lexiang Tang

Citations: 75

h-index: 3

Lu Ma

Citations: 112

h-index: 4

Zhengyang Zhao

Citations: 41

h-index: 4

Yizhen Jiang

Citations: 27

h-index: 2

Xiaochen Ma

Citations: 81

h-index: 3

Zimo Meng

Citations: 29

h-index: 3

Chengyu Shen

Citations: 141

h-index: 5

Haoze Sun

Citations: 22

h-index: 1

Wentao Zhang

Citations: 6

h-index: 2

대규모 추론 모델(LRM)의 훈련 후 학습 패러다임인, 지도 학습 미세 조정(SFT)에 이은 강화 학습(RL)은 내재적인 최적화 불일치를 겪습니다. SFT에 내재된 엄격한 지도는 분포 붕괴를 유발하여 후속 RL에 필요한 탐색 공간을 고갈시킵니다. 본 논문에서는 SFT를 통일된 훈련 후 프레임워크 내에서 재구성하고, 유한 온도 Gibbs 초기화(GIFT)를 제안합니다. 우리는 표준 SFT를 기본 사전 지식을 억제하는 극단적인 영-온도 경계 조건으로 특징짓습니다. 반대로, GIFT는 지도를 유한 온도 에너지 퍼텐셜로 통합하여, 훈련 후 파이프라인 전체에서 목적 일관성을 보장하는 분포 간의 연결을 구축합니다. 우리의 실험 결과는 GIFT가 RL 초기화에 사용될 때 표준 SFT 및 다른 경쟁적인 기준 성능보다 훨씬 우수하며, 훈련 후 단계에서 전역 최적성을 달성하기 위한 수학적으로 엄격한 경로를 제공한다는 것을 보여줍니다. 우리의 코드는 https://github.com/zzy1127/GIFT 에서 확인할 수 있습니다.

Original Abstract

The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.

1 Citations

0 Influential

32.95879734614 Altmetric

165.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!