2604.18963v1 Apr 21, 2026 cs.LG

증류 함정과 안전장치: LLM 증류 가능성을 조절하는 튜닝 장치

Distillation Traps and Guards: A Calibration Knob for LLM Distillability

Yongcheng Jing

Citations: 117

h-index: 6

Leszek Rutkowski

Citations: 9

h-index: 1

Dacheng Tao

Citations: 13

h-index: 2

Weixiao Zhan

Citations: 13

h-index: 2

지식 증류(KD)는 대규모 언어 모델(LLM)의 기능을 더 작은 모델로 이전하는 기술이지만, 예측 불가능하게 실패하거나 모델 정보 유출 위험을 초래할 수 있습니다. 본 연구에서는 여러 가지 증류 함정, 즉 꼬리 잡음, 오프라인 불안정성, 그리고 가장 근본적으로는 교사-학생 모델 간의 간극이 학습 신호를 왜곡한다는 것을 밝혀냈습니다. 이러한 함정은 과도한 확신을 동반한 환각, 자기 수정 실패, 그리고 로컬 디코딩 성능 저하를 유발하여 증류를 실패하게 만듭니다. 이러한 연구 결과를 바탕으로, 강화 학습 기반 미세 조정(RFT)을 통해 교사의 증류 가능성을 처음으로 제어할 수 있는 후처리 교정 방법을 제안합니다. 저희의 목적 함수는 작업 유용성, KL 기준점, 그리고 토크나이저 간의 교정 보상을 결합합니다. 이를 통해 증류 가능성을 기초 모델의 실질적인 안전 장치로 활용하여, 안정적인 교사-학생 모델 전달과 배포 환경을 고려한 모델 보호를 연결할 수 있습니다. 수학, 지식 질의 응답, 그리고 지시 따르기 작업에 대한 실험 결과, 교정된 교사 모델로부터 증류된 학생 모델은 기존의 지도 학습(SFT) 및 지식 증류(KD) 모델보다 우수한 성능을 보였습니다. 반면, 증류가 불가능한 교정된 교사 모델은 작업 성능을 유지하지만, 증류된 학생 모델의 성능 저하를 유발하여, 더 나은 지식 증류와 모델 지적 재산권 보호를 위한 실질적인 제어 장치를 제공합니다.

Original Abstract

Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher's distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student transfer with deployment-aware model protection. Experiments across math, knowledge QA, and instruction-following tasks show that students distilled from distillable calibrated teachers outperform SFT and KD baselines, while undistillable calibrated teachers retain their task performance but cause distilled students to collapse, offering a practical knob for both better KD and model IP protection.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!