2606.06021v1 Jun 04, 2026 cs.LG

OPRD: On-Policy Representation Distillation

Haobo Wang
Haobo Wang
Citations: 1,249
h-index: 17
Shenzhi Yang
Shenzhi Yang
Citations: 127
h-index: 4
Guangcheng Zhu
Guangcheng Zhu
Citations: 93
h-index: 3
Bowen Song
Bowen Song
Citations: 42
h-index: 4
Xing Zheng
Xing Zheng
Citations: 62
h-index: 2
Yingfan Ma
Yingfan Ma
Citations: 27
h-index: 3
Zhongqi Chen
Zhongqi Chen
Citations: 24
h-index: 3
Weiqiang Wang
Weiqiang Wang
Citations: 30
h-index: 4
Mingxuan Xia
Mingxuan Xia
Citations: 39
h-index: 4
Gang Chen
Gang Chen
Citations: 1
h-index: 1

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

0 Citations
0 Influential
28.5 Altmetric
142.5 Score
Original PDF
0

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!