2606.06021v1 Jun 04, 2026 cs.LG

OPRD: On-Policy Representation Distillation

Haobo Wang

Citations: 1,249

h-index: 17

Shenzhi Yang

Citations: 127

h-index: 4

Guangcheng Zhu

Citations: 93

h-index: 3

Bowen Song

Citations: 42

h-index: 4

Xing Zheng

Citations: 62

h-index: 2

Yingfan Ma

Citations: 27

h-index: 3

Zhongqi Chen

Citations: 24

h-index: 3

Weiqiang Wang

Citations: 30

h-index: 4

Mingxuan Xia

Citations: 39

h-index: 4

Gang Chen

Citations: 1

h-index: 1

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

0 Citations

0 Influential

28.5 Altmetric

142.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!