2605.05040v1 May 06, 2026 cs.LG

선호도 기반 자기 증류: KL 매칭을 넘어 보상 정규화를 통한 방법

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Lingzhou Xue

Citations: 83

h-index: 5

Xin Yu

Citations: 17

h-index: 3

Liucheng Liao

Citations: 48

h-index: 2

Yiwen Zhang

Citations: 14

h-index: 2

Yin Yu

Citations: 12

h-index: 2

Qin Guo

Citations: 40

h-index: 4

온라인 증류는 강화 학습의 효율적인 대안으로, 밀집된 토큰 수준의 학습 신호를 제공합니다. 그러나, 강력한 외부 교사 모델에 의존한다는 단점으로 인해 최근에는 동일한 모델이 서로 다른 프롬프트 맥락에서 교사와 학생 역할을 모두 수행하는 온라인 자기 증류 연구가 진행되고 있습니다. 그러나, 기존의 자기 증류 방법은 대부분 교사 모델에 프롬프트를 추가한 상태에서 KL 매칭을 통해 학습을 진행하며, 이는 종종 학습 불안정성을 야기하고 시간이 지남에 따라 추론 성능을 저하시킬 수 있습니다. 또한, 프롬프트 증강을 통해 동일한 모델로부터의 자기 증류는 진정한 외부 교사 모델이 제공하는 탐색적 다양성이 부족합니다. 이러한 한계를 극복하기 위해, 우리는 고정된 교사 모델 기반의 KL 매칭을 넘어, 보상 정규화를 통해 온라인 자기 증류를 재검토하는 **P**reference- extbf{B}ased extbf{S}elf- extbf{D}istillation ( extbf{PBSD})를 제안합니다. 우리는 교사 분포를 직접적으로 매칭하는 대신, 분석적으로 최적화된 보상 가중 교사 분포를 생성하는 보상 정규화 목적 함수를 도출하며, 이 목적 함수 하에서 생성된 목표 정책은 원래의 교사 모델보다 우수함을 증명합니다. 실제로, PBSD는 온-라인 학생 샘플링을 유지하면서 교사와 학생 샘플 간의 선호도 차이를 최적화합니다. 우리는 제안하는 프레임워크를 유도된 선호도 학습 문제에 대한 통계적 분석을 통해 뒷받침하며, 우리 설정에서 온라인 자기 증류가 외부 교사 모델로부터 학습하는 것보다 더 효과적인 조건을 공식적으로 규명합니다. 다양한 모델 크기에서 수학적 추론 및 도구 사용 벤치마크에 대한 실험 결과, PBSD는 경쟁적인 기본 모델들 중에서 가장 뛰어난 평균 성능을 보이며, 기존의 자기 증류 방법보다 향상된 학습 안정성을 유지하면서 토큰 효율성을 보존합니다.

Original Abstract

On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation (\textbf{PBSD}), which revisits on-policy self-distillation through a reward-regularized perspective. Instead of directly matching the teacher distribution, we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective. Practically, PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy student sampling. We support this framework with a statistical analysis of the induced preference-learning problem, formally establishing when on policy self-distillation is preferable to learning from an external teacher in our setting. Experiments on mathematical reasoning and tool-use benchmarks across multiple model scales demonstrate that PBSD consistently achieves the strongest average performance among comparable baselines, showing improved training stability over prior self-distillation baselines while preserving token efficiency.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!