2604.20244v1 Apr 22, 2026 cs.CL

LLM을 위한 하이브리드 정책 증류

Hybrid Policy Distillation for LLMs

Ruobing Xie

Citations: 984

h-index: 11

Pengfei Liu

Citations: 27

h-index: 2

Wenhong Zhu

Citations: 105

h-index: 7

Rui Wang

Citations: 98

h-index: 6

지식 증류(KD)는 대규모 언어 모델(LLM)을 압축하는 강력한 방법이지만, 그 효과는 발산 방향, 최적화 전략, 데이터 환경 등 여러 요소에 의해 결정됩니다. 본 논문에서는 기존 KD 방법들의 설계 방식을 분석하고, 이들 간의 연관성을 보여주는 통합적인 관점을 제시하며, KD를 토큰 수준에서의 가중 로그-likelihood 최적화 문제로 재정의합니다. 또한, 우리는 순방향 및 역방향 KL 발산의 상호 보완적인 장점을 통합하여 모드 커버리지와 모드 추적을 균형 있게 만들고, 오프라인 데이터를 가벼운, 근사적인 온라인 샘플링과 결합하는 하이브리드 정책 증류(HPD)를 제안합니다. 우리는 HPD를 긴 생성 형태의 수학적 추론, 짧은 생성 형태의 대화 및 코드 작업에서 검증하여, 다양한 모델 유형과 크기에서 향상된 최적화 안정성, 계산 효율성, 그리고 최종 성능을 입증했습니다. 본 연구와 관련된 코드는 https://github.com/zwhong714/Hybrid-Policy-Distillation 에서 확인할 수 있습니다.

Original Abstract

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

1 Citations

0 Influential

28.9657359028 Altmetric

145.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!