2605.15155v1 May 14, 2026 cs.LG

자기 증류를 이용한 자율 강화 학습

Self-Distilled Agentic Reinforcement Learning

Xunliang Cai

Citations: 74

h-index: 5

Qi Gu

Citations: 82

h-index: 5

Yueting Zhuang

Citations: 610

h-index: 14

Yongliang Shen

Citations: 397

h-index: 10

Jun Xiao

Citations: 249

h-index: 7

Weiming Lu

Citations: 128

h-index: 5

Zhuowen Han

Citations: 27

h-index: 3

Zhengxi Lu

Citations: 290

h-index: 6

Zhiyuan Yao

Citations: 34

h-index: 2

Zihao Wang

Citations: 75

h-index: 3

Jinyang Wu

Citations: 44

h-index: 3

강화 학습(RL)은 사전 학습된 LLM 에이전트의 후속 학습을 위한 핵심 패러다임으로 부상했지만, 경로 수준의 보상 신호는 장기 상호 작용에 대한 제한적인 수준의 지침만을 제공합니다. 온-폴리시 자기 증류(OPSD)는 교사 브랜치에서 제공하는 밀집된 토큰 수준의 지침을 RL에 추가하여 이를 보완합니다. 교사 브랜치는 추가적인 컨텍스트 정보를 활용합니다. 그러나 OPSD를 다중 턴 에이전트로 확장하는 것은 문제가 있습니다. 다중 턴의 불안정성은 지침을 불안정하게 만들고, 기술 기반의 컨텍스트 정보는 부정적인 교사 피드백에 대한 비대칭적인 처리를 요구하는데, 이는 불완전한 기술 검색 또는 활용에서 발생할 수 있습니다. 본 연구에서는 OPSD를 게이티드 보조 목표로 활용하면서 RL을 주요 최적화 방법으로 유지하는 SDAR(Self-Distilled Agentic Reinforcement Learning)을 제안합니다. SDAR은 분리된 토큰 수준의 신호를 시그모이드 게이트에 매핑하여, 교사가 긍정적으로 평가한 토큰에 대한 증류를 강화하고, 부정적인 교사 피드백은 부드럽게 감쇠시킵니다. Qwen2.5 및 Qwen3 모델 패밀리를 ALFWorld, WebShop 및 Search-QA 데이터셋에서 평가한 결과, SDAR은 GRPO보다 현저히 우수한 성능을 보였습니다 (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc). 또한, SDAR은 단순한 GRPO+OPSD의 불안정성을 피하고, 다양한 모델 크기에서 하이브리드 RL--OPSD 기반 모델보다 일관되게 우수한 성능을 보였습니다.

Original Abstract

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

1 Citations

0 Influential

7 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!