2605.06387v1 May 07, 2026 cs.LG

비대칭 온-폴리시 증류: 토큰 수준에서의 탐색과 모방 간의 간극 해소

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Ke Zeng

Citations: 106

h-index: 6

Xunliang Cai

Citations: 83

h-index: 5

Zequn Sun

Citations: 467

h-index: 14

Nan Jia

Citations: 5

h-index: 1

Haojin Yang

Citations: 4

h-index: 2

Xingchen Ma

Citations: 24

h-index: 2

Jiesong Lian

Citations: 22

h-index: 2

Weipeng Zhang

Citations: 43

h-index: 4

Shuai Zhang

Citations: 5

h-index: 2

온-폴리시 증류(OPD)는 자체 경로를 활용하여 토큰 수준의 가이드(teacher) 피드백을 통해 학생 모델을 학습시키며, 종종 오프-폴리시 증류 및 표준 강화 학습보다 우수한 성능을 보입니다. 그러나, 표준 OPD의 장점 가중 정책 그래디언트(advantage weighted policy gradient)는 높은 분산 업데이트, 0 가이드 영역에서의 기울기 소실, 그리고 교정 신호가 부족할 때 발생하는 탐색 병목 현상과 같은 세 가지 구조적 약점을 가지고 있습니다. 따라서, 우리는 비효율적인 음의 강화(negative reinforcement)를 대체하여 긍정적인 강화 학습을 유지하면서, 0보다 작은 가이드 영역에서 국소적인 발산 최소화를 통해 문제를 해결하는 비대칭 온-폴리시 증류(AOPD)를 제안합니다. 수학적 추론 벤치마크 실험 결과, AOPD는 강력한 초기화(strong initialization) 및 약한 초기화(weak initialization) 조건에서 각각 평균 4.09 / 8.34의 성능 향상을 보여주며, 표준 OPD보다 일관되게 우수한 성능을 나타냅니다. 또한, AOPD는 학습 과정에서 더 높은 정책 엔트로피를 유지하며, 순차적 도구 사용 적응 과정에서 더 나은 성능 유지 능력을 보여줍니다.

Original Abstract

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient.We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

2 Citations

1 Influential

7 Altmetric

39.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!