2604.14564v1 Apr 16, 2026 cs.AI

MARS²: 강화 학습을 통한 멀티 에이전트 트리 탐색의 확장 - 코드 생성

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

Bowen Zhou

Citations: 2,572

h-index: 19

Kaiyan Zhang

Citations: 1,911

h-index: 21

Fang Li

Citations: 52

h-index: 4

Pengfei Li

Citations: 47

h-index: 2

Shijie Wang

Citations: 116

h-index: 1

Yikun Fu

Citations: 34

h-index: 1

Kaifeng Liu

Citations: 13

h-index: 1

Dazhi Zhang

Citations: 11

h-index: 1

Yuqiang Li

Citations: 17

h-index: 2

Biqing Qi

Citations: 0

h-index: 0

강화 학습(RL)은 코드 생성과 같은 추론 집약적인 작업에서 뛰어난 성능을 보여주었습니다. 그러나 제한적인 탐색 경로 다양성은 성능 향상에 한계를 가져옵니다. 탐색을 강화하는 방법은 구조화된 탐색을 도입하여 이러한 문제를 완화하지만, 여전히 단일 에이전트 정책의 제약을 받습니다. 반면, 여러 상호 작용하는 정책을 활용하면 더욱 다양한 탐색 신호를 얻을 수 있지만, 기존 접근 방식은 일반적으로 구조화된 탐색과 분리되어 있습니다. 본 논문에서는 extbf{MARS²} (Multi-Agent Reinforced Tree-Search Scaling)라는 통합된 강화 학습 프레임워크를 제안합니다. MARS²는 여러 개의 독립적으로 최적화된 에이전트가 공유된 트리 구조 탐색 환경 내에서 협력하는 방식으로 작동합니다. MARS²는 탐색 트리를 학습 가능한 멀티 에이전트 상호 작용 환경으로 모델링하여, 다양한 에이전트들이 공유된 탐색 구조 내에서 후보 솔루션을 공동으로 생성하고 개선할 수 있도록 합니다. 효과적인 학습을 지원하기 위해, 트리 일관성을 기반으로 한 경로 수준 그룹 이점 공식을 도입하여 복잡한 탐색 경로에 대한 효과적인 신용 할당을 용이하게 합니다. 코드 생성 벤치마크 실험 결과, MARS²는 다양한 모델 조합 및 학습 환경에서 일관되게 성능을 향상시키는 것으로 나타났으며, 이는 멀티 에이전트 협력과 트리 탐색을 결합하여 강화 학습을 향상시키는 효과를 입증합니다. 본 연구의 코드는 다음 링크에서 공개적으로 이용할 수 있습니다: https://github.com/TsinghuaC3I/MARTI.

Original Abstract

Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.

0 Citations

0 Influential

61.492393582462 Altmetric

307.5 Score

Original PDF

491

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!