2601.22859v2 Jan 30, 2026 cs.SE

MEnvAgent: 검증 가능한 소프트웨어 공학을 위한 확장 가능한 다국어 환경 구축 프레임워크

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Haifeng Wang

Citations: 209

h-index: 6

Qingfu Zhu

Citations: 553

h-index: 11

Wanxiang Che

Citations: 357

h-index: 10

Bingjin Chen

Citations: 13

h-index: 2

Hua Wu

Citations: 6

h-index: 1

Chuanzhe Guo

Citations: 0

h-index: 0

Jingjing Wu

Citations: 10

h-index: 1

Sijun He

Citations: 9

h-index: 1

Yang Chen

Citations: 1

h-index: 1

Zhaoqi Kuang

Citations: 1

h-index: 1

Shilong Fan

Citations: 163

h-index: 2

Siqi Bao

Citations: 18

h-index: 2

Jing Liu

Citations: 141

h-index: 6

대규모 언어 모델(LLM) 기반 소프트웨어 공학(SWE) 에이전트의 발전은 검증 가능한 데이터셋의 부족으로 인해 제약을 받고 있으며, 이는 다양한 언어에 걸쳐 실행 가능한 환경을 구축하는 복잡성에서 비롯됩니다. 이를 해결하기 위해, 우리는 자동화된 환경 구축을 위한 다국어 프레임워크인 MEnvAgent를 소개합니다. MEnvAgent는 확장 가능한 검증 가능한 작업 인스턴스 생성을 용이하게 합니다. MEnvAgent는 다중 에이전트 계획-실행-검증 아키텍처를 사용하여 구축 실패를 자율적으로 해결하고, 과거 환경을 점진적으로 수정하여 계산 오버헤드를 줄이는 새로운 환경 재사용 메커니즘을 통합합니다. 10개의 언어에 걸쳐 1,000개의 작업으로 구성된 새로운 벤치마크인 MEnvBench에서 평가한 결과, MEnvAgent는 기준 모델보다 뛰어난 성능을 보였으며, Fail-to-Pass(F2P) 비율을 8.6% 향상시키고 시간을 43% 단축했습니다. 또한, 우리는 MEnvAgent를 활용하여 현재까지 가장 큰 규모의 오픈 소스 다국어 현실적인 검증 가능한 Docker 환경 데이터셋인 MEnvData-SWE를 구축했으며, 이를 통해 다양한 모델에서 SWE 작업에 대한 일관된 성능 향상을 가능하게 했습니다. 우리의 코드, 벤치마크, 데이터셋은 https://github.com/ernie-research/MEnvAgent에서 확인할 수 있습니다.

Original Abstract

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.

0 Citations

0 Influential

39.666066720281 Altmetric

198.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!