2602.00592v1 Jan 31, 2026 cs.AI

DockSmith: 에이전트 기반 도커 빌더를 통한 신뢰할 수 있는 코딩 환경의 확장

DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

Luck Ma

Citations: 15

h-index: 1

Yanhao Li

Citations: 560

h-index: 4

Fanqi Wan

Citations: 50

h-index: 3

Mengqiang Ren

Citations: 38

h-index: 2

Zhewei Huang

Citations: 95

h-index: 4

Jiaran Zhang

Citations: 99

h-index: 6

Di Qi

Citations: 120

h-index: 4

Xuefeng Zhao

Citations: 76

h-index: 4

Jieyi Hou

Citations: 46

h-index: 1

Xin Wu

Citations: 79

h-index: 3

Liangyu Chen

Citations: 326

h-index: 6

Yingwei Ma

Citations: 220

h-index: 2

Qi Han

Citations: 649

h-index: 6

Xiangyun Zhang

Citations: 56

h-index: 4

Zhe Xie

Citations: 356

h-index: 12

신뢰할 수 있는 도커(Docker) 기반 환경 구축은 소프트웨어 엔지니어링 에이전트의 실행 기반(execution-grounded) 훈련 및 평가를 확장하는 데 있어 주요한 병목 현상입니다. 우리는 이 문제를 해결하기 위해 설계된 특화된 에이전트형 도커 빌더인 DockSmith를 소개합니다. DockSmith는 환경 구축을 단순한 전처리 단계가 아니라, 장기적 도구 사용, 의존성 추론, 실패 복구 능력을 수행하는 핵심적인 에이전트 역량으로 간주하며, 이를 통해 도커 빌드 자체를 넘어선 영역으로 전이되는 지도(supervision) 효과를 제공합니다. DockSmith는 루프 감지 컨트롤러와 교차 작업 성공 메모리로 강화된 SWE-Factory 스타일의 파이프라인에서 생성된 대규모 실행 기반 도커 빌드 궤적(trajectories)을 통해 훈련되었습니다. 이러한 궤적을 사용하여 30B-A3B 모델을 훈련한 결과, Multi-Docker-Eval에서 39.72%의 Fail-to-Pass와 58.28%의 Commit Rate를 기록하며 오픈 소스 최고 성능(SOTA)을 달성했습니다. 또한, DockSmith는 SWE-bench Verified, SWE-bench Multilingual, Terminal-Bench 2.0에서 분포 외(out-of-distribution) 성능을 향상시켜, 환경 구축 과정이 에이전트에게 더 광범위한 이점을 제공함을 입증했습니다.

Original Abstract

Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.

1 Citations

0 Influential

6 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!