2605.03546v1 May 05, 2026 cs.SE

ProgramBench: 언어 모델이 프로그램을 처음부터 재구성할 수 있을까?

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Sten Sootla

Citations: 19,726

h-index: 12

Gabriel Synnaeve

Citations: 63,500

h-index: 58

John Yang

Citations: 214

h-index: 4

Diyi Yang

Citations: 435

h-index: 6

K. Lieret

Citations: 344

h-index: 4

Parth Thakkar

Citations: 120

h-index: 5

D. Pedchenko

Citations: 7

h-index: 1

Emily McMilin

Citations: 2,312

h-index: 7

Pengcheng Yin

Citations: 174

h-index: 5

R. Hou

Citations: 263

h-index: 9

Ofir Press

University of Washington

Citations: 12,089

h-index: 21

J. Ma

Citations: 25

h-index: 2

언어 모델은 아이디어를 처음부터 완전한 소프트웨어 프로젝트로 구현하는 데 널리 활용되고 있습니다. 에이전트들은 최소한의 인간 개입으로 코드베이스를 초기화, 유지 관리 및 확장하는 데 사용됩니다. 이러한 환경에서는 모델이 고수준의 소프트웨어 아키텍처 결정을 내려야 합니다. 그러나 기존의 벤치마크는 단일 버그 수정 또는 특정 기능 개발과 같은 제한적인 작업에 초점을 맞추고 있습니다. 따라서, 우리는 소프트웨어 엔지니어링 에이전트의 소프트웨어 개발 능력을 종합적으로 평가하기 위해 ProgramBench를 소개합니다. ProgramBench에서는 주어진 프로그램과 문서만을 이용하여, 에이전트는 참조 실행 파일의 동작과 일치하는 코드베이스를 설계하고 구현해야 합니다. 에이전트 기반 퍼징을 통해 엔드 투 엔드 행동 테스트를 생성하여, 구현 구조를 미리 지정하지 않고 평가할 수 있습니다. 저희의 200개의 작업은 간단한 CLI 도구부터 FFmpeg, SQLite, PHP 인터프리터와 같은 널리 사용되는 소프트웨어까지 다양합니다. 9개의 언어 모델을 평가한 결과, 어떤 모델도 모든 작업을 완전히 해결하지 못했으며, 가장 성능이 좋은 모델조차도 3%의 작업에서만 95%의 테스트를 통과했습니다. 모델들은 인간이 작성한 코드와는 크게 다른, 단일 파일로 구성된 모놀리식 구현 방식을 선호하는 경향을 보였습니다.

Original Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

6 Citations

1 Influential

29 Altmetric

153.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!