2602.23008v1 Feb 26, 2026 cs.LG

혼합 온- 및 오프-정책 최적화를 통한 탐색 증강 대규모 언어 모델 에이전트

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Xufang Luo

Beihang University

Citations: 1,750

h-index: 19

Dongsheng Li

Citations: 520

h-index: 7

Yuqing Yang

Citations: 33

h-index: 3

Jeonghye Kim

Citations: 88

h-index: 5

Zeyuan Liu

Citations: 29

h-index: 3

강화 학습으로 훈련된 대규모 언어 모델 에이전트에 대한 탐색은 여전히 주요 난제로 남아 있습니다. 기존 방법들은 사전 학습된 지식을 활용하지만, 새로운 상태를 발견해야 하는 환경에서는 효과가 떨어집니다. 본 연구에서는 탐색 증강 온- 및 오프-정책 최적화 (EMPO$^2$)라는 하이브리드 강화 학습 프레임워크를 제안합니다. EMPO$^2$는 메모리를 활용하여 탐색을 촉진하고, 온- 및 오프-정책 업데이트를 결합하여 메모리를 사용할 때도 뛰어난 성능을 보장하면서 동시에 메모리 없이도 안정성을 확보합니다. ScienceWorld 및 WebShop 환경에서 EMPO$^2$는 각각 GRPO보다 128.6% 및 11.3%의 성능 향상을 달성했습니다. 또한, 새로운 환경에서의 테스트에서 EMPO$^2$는 메모리를 사용하여 몇 번의 시도만으로 새로운 작업에 대한 뛰어난 적응력을 보여주며, 파라미터 업데이트 없이도 높은 성능을 유지했습니다. 이러한 결과는 EMPO$^2$가 보다 탐색적이고 일반화된 대규모 언어 모델 기반 에이전트를 구축하기 위한 유망한 프레임워크임을 시사합니다.

Original Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

5 Citations

0 Influential

9.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!