2601.12518v1 Jan 18, 2026 cs.LG

통신 제약 하에서의 협력적 다중 에이전트 강화 학습

Cooperative Multi-agent RL with Communication Constraints

Citations: 35

h-index: 3

Citations: 8

h-index: 1

협력적 다중 에이전트 강화 학습(MARL)은 종종 데이터 버퍼를 통해 팀 보상이나 다른 에이전트의 행동과 같은 글로벌 정보에 대한 빈번한 접근을 가정합니다. 그러나 이는 분산형 MARL 시스템에서 높은 통신 비용으로 인해 일반적으로 비현실적입니다. 통신이 제한될 때, 에이전트는 업데이트된 정보를 얻지 못하고 오래된 정보에 의존하여 그래디언트를 추정하고 정책을 업데이트해야 합니다. 누락된 데이터를 처리하는 일반적인 방법은 중요 샘플링(importance sampling)으로, 이를 통해 기준 정책으로부터 이전 데이터를 재가중하여 현재 정책에 대한 그래디언트를 추정합니다. 그러나 통신이 제한될 때(즉, 데이터 누락 확률이 높을 때), 기준 정책이 오래되어 중요 샘플링이 빠르게 불안정해집니다. 이 문제를 해결하기 위해, 우리는 기준 정책 예측(base policy prediction)이라는 기술을 제안합니다. 이는 이전 그래디언트를 사용하여 정책 업데이트를 예측하고 기준 정책의 일련에 대한 샘플을 수집하여 기준 정책과 현재 정책 간의 격차를 줄입니다. 이 접근 방식은 예측된 기준 정책의 샘플을 단일 통신 라운드 내에 수집할 수 있으므로, 훨씬 적은 통신 라운드로 효과적인 학습을 가능하게 합니다. 이론적으로, 우리의 알고리즘이 잠재 게임에서 $\varepsilon$-내쉬 균형으로 수렴하며, $O(\varepsilon^{-3/4})$의 통신 라운드와 $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$의 샘플만을 필요로 한다는 것을 보였습니다. 이는 기존의 최첨단 결과보다 통신 비용과 샘플 복잡성 측면에서 개선된 결과이며, 특히 공동 행동 공간 크기에 대한 지수적 의존성이 없습니다. 또한, 이러한 결과를 일반적인 마르코프 협력 게임으로 확장하여 에이전트별 로컬 최댓값을 찾습니다. 실험적으로, 우리는 제안된 기준 정책 예측 알고리즘을 시뮬레이션 게임 및 복잡한 환경에서의 MAPPO(Multi-Agent Proximal Policy Optimization)에 적용하여 성능을 검증했습니다.

Original Abstract

Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an $\varepsilon$-Nash equilibrium in potential games with only $O(\varepsilon^{-3/4})$ communication rounds and $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$ samples, improving existing state-of-the-art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent-wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!