2602.22495v1 Feb 26, 2026 cs.LG

강화 학습 기반 지식 증류를 통한 LLM 추론 성능 향상

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Yuting Zhang

Citations: 0

h-index: 0

Shuli Jiang

Citations: 27

h-index: 2

Dhananjay Ram

Citations: 90

h-index: 3

Shuo Yang

Citations: 1

h-index: 1

Zhuowen Tu

Citations: 76

h-index: 5

Wei Xia

Citations: 43

h-index: 3

S. Soatto

Citations: 1,492

h-index: 19

Yantao Shen

Citations: 32

h-index: 2

Zhaoyang Zhang

Citations: 24

h-index: 3

최근 강화 학습(RL)을 활용한 후속 훈련은 긴 추론 과정을 수행하는 대규모 언어 모델(LLM)의 성능 향상에 크게 기여했지만, 이러한 모델의 높은 추론 비용은 더 작은 모델로의 지식 증류를 유도합니다. 대부분의 기존 지식 증류(KD) 방법은 지도 미세 조정(SFT)을 위해 설계되었으며, 고정된 교사 모델의 추적이나 교사-학생 간 쿨백-라이블러(KL) 발산 기반 정규화를 사용합니다. 강화 학습과 결합할 때 이러한 접근 방식은 종종 분포 불일치 및 목표 충돌 문제를 겪습니다. 교사 모델의 지시가 학생 모델의 변화하는 출력 분포와 일치하지 않거나, KL 정규화기가 보상 최대화와 경쟁하여 신중한 손실 균형을 필요로 할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 강화 학습에 대한 인식을 갖춘 지식 증류(RLAD)를 제안합니다. RLAD는 강화 학습 과정에서 선택적 모방을 수행하여, 학생 모델이 현재 정책 업데이트를 향상시키는 경우에만 교사 모델을 따라하도록 유도합니다. 핵심 구성 요소인 Trust Region Ratio Distillation (TRRD)은 교사-과거 정책 혼합에 고정된 PPO/GRPO 스타일의 likelihood-ratio 목표를 사용하여 교사-학생 간 KL 정규화기를 대체합니다. 이를 통해 학생 모델의 출력에 대한 advantage-aware, trust-region-bounded 지식 증류를 제공하며, 탐험, 활용 및 모방을 자연스럽게 균형을 맞춥니다. 다양한 논리 추론 및 수학 벤치마크에서 RLAD는 기존의 오프라인 지식 증류, 표준 GRPO, 그리고 KL 기반 온-폴리시 교사-학생 지식 증류 방식보다 일관되게 우수한 성능을 보입니다.

Original Abstract

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!