2601.19921v1 Jan 09, 2026 cs.CL

다중 에이전트 토론의 원리 규명: 신뢰성과 다양성의 역할

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Caiqi Zhang

University of Cambridge

Citations: 571

h-index: 10

Xiaochen Zhu

Citations: 32

h-index: 3

Yizhou Chi

Citations: 59

h-index: 4

Tom Stafford

Citations: 17

h-index: 2

Nigel Collier

Citations: 75

h-index: 5

Andreas Vlachos

Citations: 179

h-index: 9

다중 에이전트 토론(MAD)은 테스트 시점 스케일링을 통해 대규모 언어 모델(LLM)의 성능을 향상시키는 데 널리 사용되지만, 최근 연구에 따르면 일반적인 MAD는 더 높은 계산 비용에도 불구하고 단순한 다수결 방식보다 성능이 떨어지는 경우가 많습니다. 연구에 따르면, 균일한 에이전트와 동일한 신념 업데이트 하에서, 토론은 예상되는 정확성을 유지하며 따라서 결과 개선에 신뢰성 있게 기여할 수 없습니다. 인간의 숙고 및 집단 의사 결정에 대한 연구 결과를 바탕으로, 일반적인 MAD에서 누락된 두 가지 핵심 메커니즘을 식별했습니다. (i) 초기 관점의 다양성 및 (ii) 명시적이고 교정된 신뢰도 전달입니다. 우리는 두 가지 간단한 개선 방법을 제안합니다. 첫째, 다양성을 고려한 초기화 방법으로, 더 다양한 후보 답변 풀을 선택하여 토론 시작 시 올바른 가설이 존재할 가능성을 높입니다. 둘째, 신뢰도 조절 토론 프로토콜로, 에이전트는 교정된 신뢰도를 표현하고 다른 에이전트의 신뢰도를 기반으로 업데이트합니다. 이론적으로, 다양성을 고려한 초기화는 MAD 성공의 사전 확률을 향상시키지만, 기본적인 업데이트 동역학은 변경하지 않으며, 신뢰도 조절 업데이트는 토론이 체계적으로 올바른 가설로 수렴하도록 합니다. 실증적으로, 여섯 가지 추론 중심 질의응답 벤치마크에서, 제안된 방법은 일반적인 MAD 및 다수결 방식보다 일관되게 우수한 성능을 보였습니다. 우리의 결과는 인간의 숙고와 LLM 기반 토론을 연결하며, 간단하고 원칙적인 수정만으로도 토론의 효과를 크게 향상시킬 수 있음을 보여줍니다.

Original Abstract

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!