2605.28396v1 May 27, 2026 cs.LG

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

Clive Bai
Clive Bai
Citations: 10
h-index: 2
Saiyong Yang
Saiyong Yang
Citations: 104
h-index: 5
Weijie Liu
Weijie Liu
Citations: 94
h-index: 4
Kun Liang
Kun Liang
Citations: 13
h-index: 3
Chenming Tang
Chenming Tang
Citations: 21
h-index: 3
Yunfang Wu
Yunfang Wu
Citations: 26
h-index: 3

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

0 Citations
0 Influential
2.5 Altmetric
12.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!