2602.10999v1 Feb 11, 2026 cs.AI

CLI-Gym: 에이전트 기반 환경 역전을 통한 확장 가능한 CLI 태스크 생성

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Feiyang Pan

Citations: 37

h-index: 4

Dandan Tu

Citations: 50

h-index: 3

Yusong Lin

Citations: 36

h-index: 3

Haiyang Wang

Citations: 20

h-index: 2

Shuzhe Wu

Citations: 18

h-index: 2

Lue Fan

Citations: 2,528

h-index: 21

Sanyuan Zhao

Citations: 1,438

h-index: 14

에이전틱 코딩은 의존성 문제 해결이나 시스템 문제 수정과 같은 작업을 완수하기 위해 에이전트가 런타임 환경(예: 명령줄 인터페이스, CLI)과 효과적으로 상호작용할 것을 요구합니다. 그러나 에이전트의 능력을 향상시키기 위해 이러한 환경 집약적 태스크를 어떻게 대규모로 확보할 수 있을지에 대해서는 아직 충분히 연구되지 않았습니다. 이를 해결하기 위해 우리는 Dockerfile과 에이전틱 태스크 간의 유사성에 기반하여, 실행 피드백을 가이드 삼아 에이전트가 환경 이력을 시뮬레이션하고 탐색하도록 하는 방법을 제안합니다. 정상적인 환경의 이력을 추적함으로써 해당 상태를 런타임 오류가 발생하는 이전 시점으로 역전시킬 수 있으며, 버그가 있는 상태와 그에 따른 오류 메시지를 패키징하여 태스크를 도출할 수 있습니다. CLI-Gym이라 명명된 우리의 방법을 통해 총 1,655개의 환경 집약적 태스크가 도출되었으며, 이는 동종 최대 규모의 컬렉션입니다. 또한, 정제된 성공 궤적을 사용하여 미세 조정된 모델인 LiberCoder는 Terminal-Bench에서 +21.1%(총 46.1%)의 상당한 절대적 성능 향상을 달성하며 여러 강력한 베이스라인 모델들을 능가했습니다. 우리가 아는 한, 이것은 환경 집약적 태스크의 대규모 도출을 위한 최초의 공개 파이프라인입니다.

Original Abstract

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

7 Citations

1 Influential

10.5 Altmetric

61.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!