2605.26646v1 May 26, 2026 cs.AI

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Haitao Li

Citations: 1,324

h-index: 17

Yan Gao

Citations: 44

h-index: 4

Lingyong Yan

Baidu Inc.

Citations: 1,503

h-index: 17

Yiqun T. Chen

Citations: 204

h-index: 7

Erhan Zhang

Citations: 91

h-index: 4

Jiaxin Mao

Citations: 159

h-index: 7

Xiaochi Wei

Citations: 11

h-index: 2

Jinyuan Feng

Citations: 29

h-index: 3

Rui Li

Citations: 7

h-index: 1

Zechun Niu

Citations: 24

h-index: 3

Yi Wu

Citations: 28

h-index: 4

Yao Hu

Citations: 16

h-index: 2

Shijie Wang

Citations: 151

h-index: 2

Wei Yang

Citations: 6

h-index: 1

Qi Liu

Citations: 145

h-index: 6

Bin Zhang

Citations: 87

h-index: 4

Biqing Qi

Citations: 930

h-index: 2

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!