2602.09447v2 Feb 10, 2026 cs.SE

SWE-AGI: MoonBit을 활용한 명세 기반 소프트웨어 구축 벤치마크: 자율 에이전트 시대

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Mingkun Xiao

Citations: 16

h-index: 2

H. Shum

Citations: 8,547

h-index: 34

Zhirui Zhang

Citations: 4

h-index: 1

Hongbo Zhang

Citations: 2

h-index: 1

Haoxiang Fei

Citations: 2

h-index: 1

Zhiyuan Bao

Citations: 5

h-index: 1

Yubin Chen

Citations: 117

h-index: 7

Zhengyu Lei

Citations: 6

h-index: 1

Ziyue Liu

Citations: 50

h-index: 4

Yixuan Sun

Citations: 39

h-index: 2

Zihan Ye

Citations: 25

h-index: 3

Yu Zhang

Citations: 53

h-index: 4

Hongcheng Zhu

Citations: 47

h-index: 1

Yuxiang Wen

Citations: 1

h-index: 1

대규모 언어 모델(LLM)이 뛰어난 코딩 능력을 보여주었지만, 명확한 명세를 기반으로 실제 규모의 소프트웨어를 자율적으로 구축할 수 있는지는 여전히 미지수입니다. 본 논문에서는 MoonBit으로 작성된 소프트웨어 시스템의 엔드투엔드, 명세 기반 구축을 평가하기 위한 오픈 소스 벤치마크인 SWE-AGI를 소개합니다. SWE-AGI는 LLM 기반 에이전트가 표준 및 RFC를 엄격하게 준수하며, 고정된 API 구조 내에서 파서, 인터프리터, 바이너리 디코더 및 SAT 솔버를 구현하도록 요구합니다. 각 작업은 숙련된 개발자가 주 또는 월 단위로 수행해야 하는 1,000~10,000 라인의 핵심 로직을 구현하는 것을 포함합니다. SWE-AGI는 초기 단계의 MoonBit 생태계를 활용하여 데이터 유출을 최소화하고, 에이전트가 코드 검색보다는 장기적인 아키텍처 추론에 의존하도록 합니다. 최첨단 모델 중에서 gpt-5.3-codex가 가장 우수한 성능(22개의 작업 중 19개 해결, 86.4%)을 보였으며, claude-opus-4.6(22개 중 15개, 68.2%)보다 우수하고, 오픈 소스 모델 중에서는 kimi-2.5가 가장 뛰어난 성능을 보였습니다. 작업 난이도가 증가함에 따라 성능이 크게 저하되며, 특히 복잡하고 명세 중심적인 시스템에서 그러한 경향이 두드러집니다. 행동 분석 결과, 코드베이스가 확장됨에 따라 AI 지원 개발에서 코드 작성보다는 코드 분석이 주요 병목 현상으로 나타났습니다. 전반적으로 명세 기반 자율 소프트웨어 엔지니어링은 점점 더 현실적인 기술이 되고 있지만, 실제 규모의 개발을 안정적으로 지원하기 위해서는 여전히 상당한 과제가 남아 있습니다.

Original Abstract

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.

1 Citations

0 Influential

17 Altmetric

86.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!