2604.19404v1 Apr 21, 2026 cs.RO

M²GRPO: Mamba 기반 다중 에이전트 그룹 상대 정책 최적화를 이용한 생체 모방 수중 로봇 추적

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Yukai Feng

Citations: 41

h-index: 4

Junwen Gu

Citations: 19

h-index: 2

Junzhi Yu

Citations: 329

h-index: 10

Zhi-zong Wu

Citations: 97

h-index: 4

Zhengxing Wu

Citations: 3,414

h-index: 34

협력 추적 분야의 기존 정책 학습 방법은 장기 의사 결정, 부분 관찰, 로봇 간 협응 등 생체 모방 수중 로봇에서 발생하는 근본적인 어려움에 직면합니다. 이러한 문제점을 해결하기 위해, 본 논문에서는 중앙 집중식 학습 및 분산 실행(CTDE) 패러다임 하에서 선택적 상태 공간 Mamba 정책과 그룹 상대 정책 최적화를 통합한 새로운 프레임워크인 Mamba 기반 다중 에이전트 그룹 상대 정책 최적화(M²GRPO)를 제안합니다. 구체적으로, Mamba 기반 정책은 관찰 기록을 활용하여 장기적인 시간적 의존성을 파악하고, 주의 기반 관계 특징을 활용하여 로봇 간 상호 작용을 인코딩하여 정규화된 가우시안 샘플링을 통해 경계가 있는 연속적인 행동을 생성합니다. 또한, 안정성을 희생하지 않고 보상 할당을 개선하기 위해, 각 에피소드 내에서 에이전트 간 보상을 정규화하여 그룹 상대적인 장점을 얻고, GRPO의 다중 에이전트 확장을 통해 이를 최적화합니다. 이를 통해 훈련 자원 요구량을 크게 줄이면서 안정적이고 확장 가능한 정책 업데이트를 가능하게 합니다. 다양한 팀 규모와 회피 전략에 대한 광범위한 시뮬레이션 및 실제 수영장 실험 결과, M²GRPO는 추적 성공률과 포획 효율성 측면에서 MAPPO 및 순환 기반 모델보다 일관되게 우수한 성능을 보였습니다. 전반적으로, 제안된 프레임워크는 생체 모방 로봇 시스템을 활용한 협력 수중 추적에 대한 실용적이고 확장 가능한 솔루션을 제공합니다.

Original Abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

0 Citations

0 Influential

17 Altmetric

85.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!