2602.23864v1 Feb 27, 2026 cs.AI

RUMAD: 강화 학습 기반 다중 에이전트 토론 통합 프레임워크

RUMAD: Reinforcement-Unifying Multi-Agent Debate

Wenbo Ding

Citations: 16

h-index: 1

Chao Wang

Citations: 1

h-index: 1

Han-Sheng Lin

Citations: 0

h-index: 0

Huaze Tang

Citations: 212

h-index: 8

Hui Lin

Citations: 45

h-index: 2

다중 에이전트 토론(MAD) 시스템은 집단 지능을 활용하여 추론 능력을 향상시키지만, 기존 접근 방식은 정확도, 합의 형성 및 계산 효율성을 동시에 최적화하는 데 어려움을 겪습니다. 정적인 토폴로지 방법은 작업 복잡성의 변화에 대한 적응성이 부족하며, 외부 LLM 기반 조정은 토론의 중립성을 저해할 수 있는 특권 정보를 도입할 위험이 있습니다. 본 연구에서는 RUMAD(Reinforcement-Unifying Multi-Agent Debate)라는 새로운 프레임워크를 제안합니다. RUMAD는 MAD에서의 동적 통신 토폴로지 제어를 강화 학습(RL) 문제로 공식화합니다. RUMAD는 콘텐츠에 독립적인 관찰 방식을 사용하여 에이전트의 추론 내용을 직접 참조하지 않고도 고수준의 토론 역학을 파악합니다. RUMAD는 솔루션 품질, 응집력 및 효율성을 모델링하기 위한 다중 목표 보상을 사용합니다. PPO 알고리즘으로 학습된 컨트롤러는 통신 그래프의 엣지 가중치를 동적으로 조정하며, 이중 임계값 메커니즘은 에이전트 활성화 및 정보 가시성에 대한 미세 조정을 가능하게 합니다. MMLU, GSM8K 및 GPQA 벤치마크를 사용한 실험 결과, RUMAD는 상당한 효율성 향상을 달성하여 토큰 비용을 80% 이상 절감하는 동시에, 단일 LLM 모델 및 여러 MAD 기준 모델과 비교하여 추론 정확도를 향상시켰습니다. 특히, MMLU 데이터만으로 학습된 RUMAD는 도메인 외부(OOD) 작업에 대해 강력한 제로샷 일반화 성능을 보여주며, 이는 학습된 통신 전략이 효과적인 다중 에이전트 협조의 작업 독립적인 원칙을 포착한다는 것을 나타냅니다. 이러한 결과는 RUMAD가 실질적인 리소스 제약 조건 하에서 다중 에이전트 추론 애플리케이션을 효율적이고 안정적으로 배포할 수 있는 접근 방식임을 입증합니다.

Original Abstract

Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content-agnostic observation scheme that captures high-level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi-objective reward to model solution quality, cohesion and efficiency. A PPO-trained controller dynamically adjusts edge weights in the communication graph, while a dual-threshold mechanism enables fine-grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80\%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero-shot generalization to out-of-domain (OOD) tasks, indicating that the learned communication strategies capture task-independent principles of effective multi-agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi-agent reasoning application with practical resource constraints.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!