2602.22786v1 Feb 26, 2026 cs.MA

QSIM: 액션 유사성 가중 Q-러닝을 통한 다중 에이전트 강화 학습에서의 과대 추정 완화

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li

Citations: 18

h-index: 2

Bin Zhang

Citations: 721

h-index: 13

Hao Chen

Citations: 18

h-index: 3

Zhouyang Jiang

Citations: 6

h-index: 1

Dapeng Li

Citations: 334

h-index: 11

Zhiwei Xu

Citations: 575

h-index: 11

값 분해(VD) 방법은 협력적 다중 에이전트 강화 학습(MARL)에서 뛰어난 성공을 거두었습니다. 그러나, 이러한 방법들은 시간차(TD) 목표 계산에 최대 연산자를 사용하기 때문에 체계적인 Q-값 과대 추정 문제를 야기합니다. 이 문제는 특히 다중 에이전트 강화 학습에서 복합적인 연산 공간의 폭발적인 증가로 인해 심각하며, 이는 종종 불안정한 학습과 최적 이하의 정책으로 이어집니다. 이 문제를 해결하기 위해, 액션 유사성을 활용하여 TD 목표를 재구성하는 QSIM이라는 유사성 가중 Q-러닝 프레임워크를 제안합니다. QSIM은 탐욕적인(greedy) 연산 공간을 직접 사용하는 대신, 구조화된 근-탐욕적인 연산 공간에 대한 유사성 가중 기댓값을 형성합니다. 이러한 구조는 목표가 다양한, 그러나 행동적으로 관련된 연산들의 Q-값을 통합하면서, 탐욕적인 선택과 더 유사한 연산에 더 큰 영향을 부여하도록 합니다. QSIM은 구조적으로 관련된 대안을 통해 목표를 부드럽게 만들어 과대 추정을 효과적으로 완화하고 학습의 안정성을 향상시킵니다. 광범위한 실험 결과, QSIM이 다양한 VD 방법과 원활하게 통합될 수 있으며, 원래 알고리즘보다 우수한 성능과 안정성을 지속적으로 제공한다는 것을 보여줍니다. 또한, 실증적 분석은 QSIM이 MARL에서 발생하는 체계적인 값 과대 추정을 크게 완화한다는 것을 확인합니다. 코드: https://github.com/MaoMaoLYJ/pymarl-qsim

Original Abstract

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

0 Citations

0 Influential

26.5 Altmetric

132.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!