2605.14483v1 May 14, 2026 cs.AI

LEMON: 반사실적 강화 학습을 통한 실행 가능한 다중 에이전트 오케스트레이션 학습

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

Hua Wei

Citations: 29

h-index: 3

Xudong Chen

Citations: 28

h-index: 3

Yixin Liu

Citations: 50

h-index: 4

Kaize Ding

Citations: 202

h-index: 6

대규모 언어 모델(LLM)은 다중 에이전트 시스템의 강력한 기반이 되었지만, 그 효과는 오케스트레이션 설계에 크게 의존합니다. 다양한 작업에서 역할 설계, 용량 할당, 의존성 구축은 솔루션 품질과 실행 효율성에 공동으로 영향을 미칩니다. 기존 접근 방식은 이러한 설계 프로세스의 일부를 자동화하지만, 종종 이러한 결정을 부분적으로 또는 순차적으로 최적화하며, 실행 수준의 피드백에 의존하여 로컬 오케스트레이션 결정에 대한 제한적인 보상 할당을 제공합니다. 우리는 LEMON (LEarning EXecutable Multi-agent ORCHestration via Counterfactual Reinforcement Learning)을 제안합니다. LEMON은 LLM 기반 오케스트레이터로, 실행 가능한 오케스트레이션 사양을 생성합니다. 이 사양은 작업별 역할, 맞춤형 작업, 용량 수준 및 의존성 구조를 단일 배포 가능한 시스템으로 통합합니다. 오케스트레이터를 훈련하기 위해, 우리는 오케스트레이션 수준의 GRPO 목표에 지역화된 반사실적 신호를 추가합니다. 이 신호는 역할, 용량 또는 의존성 필드를 수정하고, 수정된 부분에 대해서만 결과 보상을 적용합니다. MMLU, GSM8K, AQuA, MultiArith, SVAMP 및 HumanEval을 포함한 6가지 추론 및 코딩 벤치마크에 대한 실험 결과, LEMON은 평가된 다중 에이전트 오케스트레이션 방법 중에서 최첨단 성능을 달성했습니다. 저희 코드는 https://anonymous.4open.science/r/LEMON-B23C 에서 확인할 수 있습니다.

Original Abstract

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!