2601.10114v1 Jan 15, 2026 cs.AI

교사의 발자취를 따라서: 도메인 특화 LLM을 위한 스케줄된 체크포인트 지식 증류

Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs

Chengang Feng

Citations: 34

h-index: 3

Chaoliang Zhong

Citations: 160

h-index: 6

Jun Sun

Citations: 6

h-index: 2

Yusuke Oishi

Citations: 2

h-index: 1

대규모 언어 모델(LLM)은 그 거대한 규모로 인해 도메인 특화 작업에 배포하기가 어렵다. 미세 조정(fine-tuned)된 LLM을 더 작은 학생 모델로 지식 증류(distilling)하는 것이 유망한 대안이지만, 교사 모델과 학생 모델 간의 용량 차이(capacity gap)로 인해 종종 최적의 성능을 내지 못하는 경우가 많다. 이는 '언제, 그리고 어떻게 학생 모델이 도메인 특화 작업에서 교사 모델과 대등하거나 심지어 이를 능가할 수 있는가?'라는 핵심적인 질문을 제기한다. 본 연구에서 우리는 새로운 이론적 통찰을 제시한다. 바로 학생 모델이 '학생 선호 하위 도메인(SFS)'에서 갖는 이점이 '교사 선호 하위 도메인(TFS)'에서의 결점을 능가할 경우 학생이 교사를 앞설 수 있다는 것이다. 이러한 통찰을 바탕으로, 우리는 도메인 작업에 대한 지도 미세 조정(SFT) 중 교사의 수렴 과정을 모방하여 TFS에서의 결점을 줄이는 '스케줄된 체크포인트 증류(SCD)'와, SFS에서 학생의 강점을 보존하기 위한 샘플별 '적응형 가중치(AW)' 메커니즘을 제안한다. 다국어 질의응답(QA), 개체명 인식(NER), 텍스트 분류 등 다양한 도메인 작업에 걸친 실험 결과, 제안한 방법은 기존의 지식 증류 접근법보다 일관되게 우수한 성능을 보였으며, 학생 모델이 미세 조정된 교사 모델의 성능과 대등하거나 이를 능가할 수 있음을 입증했다.

Original Abstract

Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks--including QA, NER, and text classification in multiple languages--show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!