2605.28396v1 May 27, 2026 cs.LG

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

Clive Bai

Citations: 10

h-index: 2

Saiyong Yang

Citations: 104

h-index: 5

Weijie Liu

Citations: 94

h-index: 4

Kun Liang

Citations: 13

h-index: 3

Chenming Tang

Citations: 21

h-index: 3

Yunfang Wu

Citations: 26

h-index: 3

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!