2605.26646v1 May 26, 2026 cs.AI

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Haitao Li
Haitao Li
Citations: 1,324
h-index: 17
Yan Gao
Yan Gao
Citations: 44
h-index: 4
Lingyong Yan
Lingyong Yan
Baidu Inc.
Citations: 1,503
h-index: 17
Yiqun T. Chen
Yiqun T. Chen
Citations: 204
h-index: 7
Erhan Zhang
Erhan Zhang
Citations: 91
h-index: 4
Jiaxin Mao
Jiaxin Mao
Citations: 159
h-index: 7
Xiaochi Wei
Xiaochi Wei
Citations: 11
h-index: 2
Jinyuan Feng
Jinyuan Feng
Citations: 29
h-index: 3
Rui Li
Rui Li
Citations: 7
h-index: 1
Zechun Niu
Zechun Niu
Citations: 24
h-index: 3
Yi Wu
Yi Wu
Citations: 28
h-index: 4
Yao Hu
Yao Hu
Citations: 16
h-index: 2
Shijie Wang
Shijie Wang
Citations: 151
h-index: 2
Wei Yang
Wei Yang
Citations: 6
h-index: 1
Qi Liu
Qi Liu
Citations: 145
h-index: 6
Bin Zhang
Bin Zhang
Citations: 87
h-index: 4
Biqing Qi
Biqing Qi
Citations: 930
h-index: 2

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

0 Citations
0 Influential
8.5 Altmetric
42.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!