2601.10348v1 Jan 15, 2026 cs.CL

학습 경로 인식 토큰 선택 방법

Training-Trajectory-Aware Token Selection

Zenan Huang

Citations: 210

h-index: 8

Guoshan Lu

Citations: 311

h-index: 9

Wen-song Ye

Citations: 259

h-index: 7

Junbo Zhao

Citations: 180

h-index: 6

Junlin Zhou

Citations: 203

h-index: 5

Yihong Zhuang

Citations: 180

h-index: 6

Zeyu Qin

Citations: 158

h-index: 3

Zhanming Shen

Citations: 54

h-index: 2

Jiaqi Hu

Citations: 182

h-index: 5

Hao Chen

Citations: 1

h-index: 1

효율적인 지식 전달(distillation)은 고가의 추론 능력을 활용 가능한 효율성으로 전환하는 핵심적인 방법이지만, 학생 모델이 이미 강력한 추론 능력을 갖춘 최첨단 단계에서는, 단순한 지속적인 지식 전달이 제한적인 효과를 보이거나 심지어 성능 저하를 초래할 수 있습니다. 우리는 훈련 과정에서 특이한 현상을 관찰했습니다. 손실이 단조적으로 감소하는 동안에도, 모든 성능 지표가 거의 동시에 급격하게 감소하는 병목 현상이 나타난 후 점진적으로 회복됩니다. 우리는 더 나아가 토큰 수준의 메커니즘을 밝혀냈습니다. 이때, 신뢰도는 꾸준히 증가하는 '모방-기준 토큰(Imitation-Anchor Tokens)'과 최적화되지 않고 신뢰도가 억제되는 다른 토큰들로 분기되며, 이 두 가지 유형의 토큰이 공존할 수 없다는 점이 지속적인 지식 전달 실패의 근본 원인입니다. 이에, 우리는 학습 경로를 인식하는 토큰 선택 방법(Training-Trajectory-Aware Token Selection, T3S)을 제안합니다. T3S는 토큰 수준에서 학습 목표를 재구성하여 아직 학습되지 않은 토큰들의 최적화 경로를 확보합니다. T3는 AR(Answer Retrieval) 및 dLLM(Distilled Language Model) 환경 모두에서 일관된 성능 향상을 보여줍니다. 단 몇 백 개의 예시만으로도 Qwen3-8B 모델이 DeepSeek-R1 모델을 능가하고, Qwen3-32B 모델이 Qwen3-235B 모델에 근접하며, T3로 훈련된 LLaDA-2.0-Mini 모델은 AR 기준 모델을 능가하여 16B 규모의 사고하지 않는 모델 중 최고 성능을 달성했습니다.

Original Abstract

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!