2601.06328v1 Jan 09, 2026 cs.AI

ToolGym: 확장 가능한 에이전트 테스트 및 데이터 큐레이션을 위한 오픈 월드 도구 사용 환경

ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

Ziqiao Xi

Citations: 19

h-index: 1

Letian Peng

Citations: 174

h-index: 8

Fang Nan

Citations: 10

h-index: 1

Meshal Nayim

Citations: 1

h-index: 1

Tianhui Zhang

Citations: 65

h-index: 2

Rishika Mundada

Citations: 1

h-index: 1

Biwei Huang

Citations: 25

h-index: 3

Kun Zhou

Citations: 4

h-index: 1

Lianhui Qin

Citations: 40

h-index: 3

Qi Liu

Citations: 764

h-index: 11

Shuang Liang

Citations: 8

h-index: 2

Jiaqing Zhang

Citations: 14

h-index: 2

도구 사용 LLM 에이전트는 대규모 도구 풀(pool), 장기적인 목표, 복잡한 제약 조건, 그리고 신뢰할 수 없는 도구 상태가 존재하는 오픈 월드 환경에서 여전히 어려움을 겪고 있습니다. 확장 가능하고 현실적인 학습 및 테스트를 위해, 우리는 204개의 상용 앱에 걸쳐 형식이 통일된 5,571개의 도구를 기반으로 구축된 오픈 월드 도구 사용 환경을 소개합니다. 이 환경에는 복잡한 제약 조건이 있는 장기 다중 도구 워크플로우를 합성하는 작업 생성 엔진과, 견고성(robustness)을 스트레스 테스트하기 위해 중단 및 실패 상황을 주입하는 상태 제어기가 포함되어 있습니다. 이 환경을 기반으로 우리는 신중한 추론 및 자기 수정을 단계별 실행과 분리하기 위해 계획자-행위자(planner-actor) 분해 방식을 사용하는 '도구 선택 후 실행(select-then-execute)' 에이전트 프레임워크를 개발했습니다. 최신 LLM들에 대한 포괄적인 평가를 통해 도구 계획 능력과 실행 능력 간의 불일치, 기존 LLM들의 제약 조건 준수 약점, 그리고 DeepSeek-v3.2의 가장 강력한 견고성을 확인했습니다. 마지막으로, 우리는 이 환경에서 1,170개의 궤적을 수집하여 LLM을 미세 조정(fine-tune)했으며, 119,000개의 샘플을 사용한 베이스라인보다 우수한 성능을 달성하여 도구 사용 에이전트를 위한 현실적인 벤치마크이자 데이터 엔진으로서 이 환경의 가치를 입증했습니다. 우리의 코드와 데이터는 공개될 예정입니다.

Original Abstract

Tool-using LLM agents still struggle in open-world settings with large tool pools, long-horizon objectives, wild constraints, and unreliable tool states. For scalable and realistic training and testing, we introduce an open-world tool-using environment, built on 5,571 format unified tools across 204 commonly used apps. It includes a task creation engine that synthesizes long-horizon, multi-tool workflows with wild constraints, and a state controller that injects interruptions and failures to stress-test robustness. On top of this environment, we develop a tool select-then-execute agent framework with a planner-actor decomposition to separate deliberate reasoning and self-correction from step-wise execution. Comprehensive evaluation of state-of-the-art LLMs reveals the misalignment between tool planning and execution abilities, the constraint following weakness of existing LLMs, and DeepSeek-v3.2's strongest robustness. Finally, we collect 1,170 trajectories from our environment to fine-tune LLMs, achieving superior performance to baselines using 119k samples, indicating the environment's value as both a realistic benchmark and a data engine for tool-using agents. Our code and data will be publicly released.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!