2605.01347v1 May 02, 2026 cs.CL

MAD-OPD: 다중 에이전트 토론 기반 온-폴리시 증류를 통한 성능 향상

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

Jianze Wang

Citations: 779

h-index: 18

Yu Cao

Citations: 1

h-index: 1

Ying Liu

Citations: 26

h-index: 2

Jinlong Chen

Citations: 29

h-index: 2

Xuchun Hu

Citations: 1

h-index: 1

Qilong Zhang

Citations: 1

h-index: 1

Jun Wang

Citations: 27

h-index: 2

huan yang

Citations: 100

h-index: 3

Yong Xie

Citations: 4

h-index: 1

Qianglong Chen

Citations: 138

h-index: 4

온-폴리시 증류(OPD)는 토큰 수준의 교사 모델 감독 하에 학생 모델이 자체적인 학습 경로를 통해 학습하지만, 기존 방법들은 단일 교사 모델의 한계에 직면합니다. 즉, 교사 모델이 오류를 범하면 학생 모델 또한 해당 오류를 학습하게 됩니다. 또한, OPD는 에이전트 기반 작업에서 널리 연구되지 않았습니다. 왜냐하면 에이전트 기반 작업에서는 단계별 오류가 누적되어 긴 경로 전체에 영향을 미치고 학습을 불안정하게 만들기 때문입니다. 우리는 이 한계를 극복하기 위해 다중 에이전트 토론 기반 온-폴리시 증류(MAD-OPD)를 제안합니다. MAD-OPD는 증류 과정에서 교사 모델을 학생 모델의 온-폴리시 상태에 대해 토론하는 여러 교사 모델 집합으로 재구성합니다. 이러한 토론을 통해 집단 지성이 형성되어 토큰 수준의 감독 신호를 제공하며, 각 교사 모델의 기여도는 토론 후의 신뢰도에 따라 가중치가 부여됩니다. 또한, OPD를 에이전트 기반 작업에 적용하기 위해, 단계별 샘플링을 추가하여 다단계 오류 누적 하에서 학습을 안정화하는 온-폴리시 에이전트 증류(OPAD)를 소개합니다. 추가적으로, 작업에 적합한 발산 원리를 도출하여 에이전트 기반 안정성을 위해 Jensen-Shannon 발산(JSD)을, 코드 생성의 경우 reverse Kullback-Leibler 발산을 선택하고, 이를 이론적 및 실증적으로 검증합니다. 6가지 교사-학생 구성(Qwen3 및 Qwen3.5; 1.7B-14B 학생 모델, 8B-32B 교사 모델) 및 5가지 에이전트 기반 및 코드 벤치마크에서 MAD-OPD는 모든 구성에서 가장 높은 순위를 기록했습니다. 특히, 14B+8B$ o$4B 설정에서 MAD-OPD는 더 강력한 단일 교사 모델 OPD보다 에이전트 기반 작업의 평균 성능을 +2.4% 향상시키고 코드 생성 작업의 평균 성능을 +3.7% 향상시켰습니다.

Original Abstract

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.

1 Citations

0 Influential

9 Altmetric

46.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!