2605.07725v1 May 08, 2026 cs.CL

SOD: 단계별 온라인 정책 증류를 통한 소규모 언어 모델 에이전트

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Junfeng Fang

Citations: 689

h-index: 14

Mingyang Song

Citations: 225

h-index: 8

Mao Zheng

Citations: 213

h-index: 8

Xiang Wang

Citations: 403

h-index: 10

Qiyong Zhong

Citations: 14

h-index: 2

Xin Lin

Citations: 48

h-index: 5

Jie Sun

Citations: 32

h-index: 3

Houcheng Jiang

Citations: 385

h-index: 7

도구 통합 추론(TIR)은 장기적인 도구 상호작용의 불안정성과 제한된 모델 용량으로 인해 소규모 언어 모델로 확장하기 어렵습니다. 강화 학습 방법인 그룹 상대 정책 최적화는 결과 수준의 희소한 보상만을 제공합니다. 최근에는 온라인 정책 증류(OPD)가 교사 모델로부터 생성된 경로에 대한 밀집적인 토큰 수준의 감독 신호를 제공하여 인기를 얻고 있습니다. 그러나 우리의 실험 결과, TIR에 OPD를 적용하면 심각한 오류가 발생합니다. 오류가 있는 도구 호출은 후속 추론 단계에서 연쇄적으로 발생하여 학생-교사 모델 간의 차이를 점진적으로 확대시키고 교사의 토큰 수준 감독 신호의 신뢰성을 저하시킵니다. 이러한 문제를 해결하기 위해, 우리는 소규모 언어 모델 에이전트를 위한 단계별 온라인 정책 증류 프레임워크인 SOD를 제안합니다. SOD는 단계 수준의 차이에 따라 각 단계에서 증류 강도를 적응적으로 조정합니다. 따라서 SOD는 높은 차이 영역에서 잠재적으로 오해의 소지가 있는 교사 신호를 완화하는 동시에, 잘 정렬된 상태에서는 밀집적인 지침을 유지할 수 있습니다. 어려운 수학, 과학 및 코딩 벤치마크에 대한 실험 결과, SOD는 두 번째로 좋은 기준 모델보다 최대 20.86%의 성능 향상을 달성했습니다. 특히, 0.6B의 학생 모델은 AIME 2025에서 26.13%의 정확도를 달성하여, 에이전트 추론을 경량 모델로 효과적으로 이전할 수 있음을 보여줍니다. 우리의 코드는 https://github.com/YoungZ365/SOD에서 사용할 수 있습니다.

Original Abstract

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

1 Citations

0 Influential

27 Altmetric

136.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!