2604.13016v2 Apr 14, 2026 cs.LG

대규모 언어 모델의 온-정책 증류에 대한 재고: 현상학, 메커니즘 및 방법론

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Bingxiang He

Citations: 1,018

h-index: 9

Yuxin Zuo

Citations: 903

h-index: 9

Wenkai Yang

Citations: 69

h-index: 5

Jinqian Zhang

Citations: 63

h-index: 4

Chaojun Xiao

Citations: 3,292

h-index: 23

Zhiyuan Liu

Citations: 505

h-index: 8

Cheng Qian

Citations: 77

h-index: 4

Huan Gao

Citations: 218

h-index: 9

Ning Ding

Citations: 507

h-index: 6

Yaxuan Li

Citations: 46

h-index: 3

Tianyu Yu

Citations: 101

h-index: 3

온-정책 증류(OPD)는 대규모 언어 모델의 추가 학습에서 핵심적인 기술로 자리 잡았지만, 그 학습 과정에 대한 이해는 여전히 부족합니다. 본 논문은 OPD의 동역학과 메커니즘에 대한 체계적인 연구를 제공합니다. 먼저, OPD의 성공 또는 실패 여부를 결정하는 두 가지 조건을 제시합니다. (i) 학생 모델과 교사 모델은 호환 가능한 사고 방식을 공유해야 하며, (ii) 일관된 사고 방식과 높은 점수를 갖더라도, 교사 모델은 학생 모델이 학습 과정에서 경험하지 못한 진정으로 새로운 기능을 제공해야 합니다. 이러한 주장을 약-강 역증류(weak-to-strong reverse distillation)를 통해 검증했으며, 동일 계열의 1.5B 및 7B 교사 모델이 학생 모델의 관점에서 볼 때 분포적으로 구별되지 않음을 보였습니다. 토큰 수준의 메커니즘을 분석한 결과, 성공적인 OPD는 학생 모델이 방문한 상태에서 높은 확률을 갖는 토큰에 대한 점진적인 정렬로 특징지어지며, 대부분의 확률 질량이 집중되는 작은 공유 토큰 집합(97%-99%)이 존재함을 확인했습니다. 또한, 실패한 OPD를 복구하기 위한 두 가지 실용적인 전략을 제안합니다. 즉, 오프-정책 초기 학습(off-policy cold start)과 교사 모델에 맞춰 설계된 프롬프트 선택(teacher-aligned prompt selection)입니다. 마지막으로, OPD가 제공하는 것처럼 보이는 높은 수준의 토큰 단위 보상이 실제로는 비용을 수반하며, OPD가 장기간의 증류에 적합한 기술인지에 대한 의문을 제기합니다.

Original Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

2 Citations

0 Influential

11.5 Altmetric

59.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!