2603.21563v1 Mar 23, 2026 cs.AI

다중 에이전트 협업을 위한 반사실적 신용 정책 최적화

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Fuzhen Zhuang

Citations: 87

h-index: 6

Y. Ban

Citations: 95

h-index: 6

Yang Liu

Citations: 32

h-index: 4

Wan Tian

Citations: 18

h-index: 3

Zhongyi Li

Citations: 26

h-index: 3

Huiming Zhang

Citations: 5

h-index: 2

Jinjun Chen

Citations: 3

h-index: 1

협업을 수행하는 다중 에이전트 대규모 언어 모델(LLM)은 역할을 분담하고 다양한 가설을 통합하여 복잡한 추론 문제를 해결할 수 있습니다. 그러나 이러한 시스템에 대한 강화 학습(RL)은 종종 신용 할당 문제로 인해 어려움을 겪습니다. 공유되는 전역 보상은 개별 에이전트의 기여도를 모호하게 만들어 업데이트의 분산을 증가시키고, 무임승차 현상을 부추깁니다. 본 연구에서는 각 에이전트의 주변 기여도를 반사실적 경로를 통해 추정하여 에이전트별 학습 신호를 할당하는 프레임워크인 반사실적 신용 정책 최적화(CCPO)를 제안합니다. CCPO는 에이전트의 기여도를 제거한 시나리오의 결과를 시뮬레이션하는 동적 반사실적 기준을 구축하여, 역할에 민감한 이점을 정책 최적화에 활용합니다. 또한, 이질적인 작업 및 데이터 분포 하에서 안정성을 더욱 향상시키기 위해, 전역 롤아웃 통계를 사용하여 이점을 조정하는 전역 이력 기반 정규화 방식을 제안합니다. CCPO는 순차적인 사고-추론 쌍 및 다중 에이전트 투표를 포함한 두 가지 협업 토폴로지에서 평가되었습니다. 수학적 및 논리적 추론 벤치마크에서 CCPO는 무임승차 현상을 완화하고 강력한 다중 에이전트 RL 기준 모델보다 뛰어난 성능을 보이며, 협업 LLM 훈련을 위한 보다 세밀하고 효과적인 신용 할당을 제공합니다. 본 연구의 코드는 https://github.com/bhai114/ccpo 에서 확인할 수 있습니다.

Original Abstract

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.

3 Citations

1 Influential

26.4657359028 Altmetric

137.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!