2603.07972v1 Mar 09, 2026 cs.AI

인간과의 적응적 협력: 지속적 학습을 위한 다중 에이전트 LLM의 메타인지 정책 최적화

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang

Citations: 20

h-index: 3

Muyan Weng

Citations: 11

h-index: 2

Defu Cao

Citations: 2,195

h-index: 12

Jiacheng Pang

Citations: 16

h-index: 2

Yan Liu

Citations: 265

h-index: 9

개별 대규모 언어 모델(LLM)의 규모를 확장하는 것은 놀라운 발전을 가져왔지만, 다음 단계는 다중 에이전트 시스템(MAS)을 통한 협력의 확장에 있습니다. 그러나 순수하게 자율적인 MAS는 사전 학습된 모델의 정적인 지식 범위를 가진 "폐쇄형" 시스템으로, 사전 학습 데이터 이상의 지식이 필요한 작업에서 쉽게 오류를 발생시키며, 새로운 과제에 직면했을 때 집단적인 실패를 초래할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 인간-에이전트 협력을 위한 체계적인 패러다임인 Human-In-the-Loop Multi-Agent Collaboration (HILA) 프레임워크를 제안합니다. HILA는 에이전트가 메타인지 정책을 학습하도록 훈련하여, 문제가 스스로 해결 가능한지 또는 인간 전문가에게 도움을 요청해야 하는지를 결정합니다. 이 정책을 구현하기 위해, 우리는 즉각적인 의사 결정과 장기적인 능력 향상을 분리하는 Dual-Loop Policy Optimization을 도입합니다. 내부 루프는 비용을 고려한 보상을 사용하는 Group Relative Policy Optimization (GRPO)을 적용하여 도움 요청 결정을 최적화하고, 외부 루프는 지속적인 학습을 통해 전문가 피드백을 고품질의 지도 학습 신호로 변환하여 에이전트의 추론 능력을 강화합니다. 어려운 수학 및 문제 해결 벤치마크에 대한 실험 결과, Dual-Loop Policy Optimization을 갖춘 HILA는 고급 MAS보다 일관되게 우수한 성능을 보이며, 협력적이고 지속적으로 발전하는 에이전트 시스템을 위한 체계적인 기반을 구축합니다.

Original Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!