2603.03680v1 Mar 04, 2026 cs.AI

MAGE: 전략적 탐색 및 활용을 위한 언어 에이전트의 메타 강화 학습

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

Jiaxuan Gao

Citations: 900

h-index: 9

Yi Wu

Citations: 704

h-index: 9

Zelai Xu

Citations: 372

h-index: 7

Luhao Yang

Citations: 51

h-index: 4

Minyang Xie

Tsinghua University

Citations: 102

h-index: 1

Zhao Shok

Citations: 0

h-index: 0

Yu Wang

Citations: 67

h-index: 4

대규모 언어 모델(LLM) 에이전트는 학습된 작업에서 놀라운 능력을 보여주지만, 피드백이 있는 비정상적인 환경에 적응하는 데 어려움을 겪는 경우가 많습니다. 컨텍스트 학습 및 외부 메모리는 어느 정도의 유연성을 제공하지만, 장기적인 개선에 필요한 적응 능력을 내재화하지 못합니다. 메타 강화 학습(meta-RL)은 학습 과정을 모델 내에 직접 통합하여 대안적인 접근 방식을 제공합니다. 그러나 LLM을 위한 기존의 메타-RL 방법은 주로 단일 에이전트 환경에서의 탐색에 초점을 맞추고 있으며, 다중 에이전트 환경에 필요한 전략적 활용을 간과합니다. 본 논문에서는 LLM 에이전트가 전략적 탐색 및 활용을 수행할 수 있도록 지원하는 메타-RL 프레임워크인 MAGE를 제안합니다. MAGE는 상호 작용 기록 및 성찰을 컨텍스트 윈도우에 통합하는 다중 에피소드 훈련 방식을 사용합니다. MAGE는 최종 에피소드 보상을 목표로 사용하여 에이전트가 과거 경험을 기반으로 전략을 개선하도록 유도합니다. 또한, 에이전트 다양성을 풍부하게 하고 안정적인 학습을 보장하기 위해 개체군 기반 훈련과 에이전트별 이점 정규화 기술을 결합했습니다. 실험 결과는 MAGE가 탐색 및 활용 작업 모두에서 기존의 기준 모델보다 뛰어난 성능을 보임을 보여줍니다. 또한, MAGE는 보이지 않는 상대로도 강력한 일반화 능력을 보여주며, 이는 전략적 탐색 및 활용 능력을 내재화했음을 시사합니다. 코드: https://github.com/Lu-Yang666/MAGE

Original Abstract

Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus primarily on exploration in single-agent settings, neglecting the strategic exploitation necessary for multi-agent environments. We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population-based training with an agent-specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at https://github.com/Lu-Yang666/MAGE.

0 Citations

0 Influential

31.431471805599 Altmetric

157.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!