2602.10090v2 Feb 10, 2026 cs.AI

에이전트 월드 모델: 에이전트 기반 강화 학습을 위한 무한한 합성 환경

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Siwei Han

Fudan University, University of North Carolina at Chapel Hill

Citations: 426

h-index: 10

Canwen Xu

Citations: 92

h-index: 4

Boyi Liu

Citations: 23

h-index: 2

Yite Wang

Citations: 57

h-index: 2

Yuxiong He

Citations: 1,644

h-index: 19

Zhaoyang Wang

Citations: 181

h-index: 6

Zhewei Yao

Citations: 124

h-index: 5

Huaxiu Yao

Citations: 107

h-index: 7

최근 대규모 언어 모델(LLM)의 발전은 자율 에이전트가 도구 및 환경과의 다중 턴 상호 작용을 필요로 하는 복잡한 작업을 수행할 수 있도록 지원했습니다. 그러나 이러한 에이전트 훈련의 확장성은 다양하고 신뢰할 수 있는 환경의 부족으로 인해 제한됩니다. 본 논문에서는 완전한 합성 환경 생성 파이프라인인 Agent World Model (AWM)을 제안합니다. 이 파이프라인을 사용하여 일상적인 시나리오를 포함하는 1,000개의 환경으로 확장했으며, 에이전트는 평균적으로 35개의 다양한 도구 세트와 상호 작용하고 고품질의 관찰 데이터를 얻을 수 있습니다. 주목할 점은 이러한 환경이 코드 기반으로 작동하며 데이터베이스를 기반으로 하므로 LLM으로 시뮬레이션된 환경보다 더 신뢰성 있고 일관된 상태 전환을 제공합니다. 또한, 실제 환경에서 데이터를 수집하는 것보다 에이전트 상호 작용을 더욱 효율적으로 만듭니다. 본 연구에서는 이 자원을 활용하여 다중 턴 도구 사용 에이전트에 대한 대규모 강화 학습을 수행했습니다. 완전하게 실행 가능한 환경과 접근 가능한 데이터베이스 상태 덕분에 신뢰할 수 있는 보상 함수를 설계할 수 있습니다. 세 가지 벤치마크에 대한 실험 결과, 벤치마크별 환경이 아닌 합성 환경에서만 훈련하는 것이 더 강력한 일반화 성능을 제공한다는 것을 확인했습니다. 코드 및 관련 정보는 다음 주소에서 확인할 수 있습니다: https://github.com/Snowflake-Labs/agent-world-model.

Original Abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

14 Citations

6 Influential

56.923984667453 Altmetric

310.6 Score

Original PDF

240

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!