2602.01655v2 Feb 02, 2026 cs.AI

ProjDevBench: 엔드 투 엔드 프로젝트 개발을 위한 AI 코딩 에이전트 벤치마킹

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Wei Wang

Citations: 0

h-index: 0

Lyumanshan Ye

Citations: 308

h-index: 7

Chao Huang

Citations: 420

h-index: 6

Pengrui Lu

Citations: 234

h-index: 4

Shiqi Zhang

Citations: 305

h-index: 4

Yunzhong Hou

Citations: 5

h-index: 1

Ji Zeng

Citations: 25

h-index: 3

Pengfei Liu

Citations: 289

h-index: 6

Mingchao Yang

Citations: 69

h-index: 5

Zixin Chen

Citations: 87

h-index: 4

Hantao Jiang

Citations: 33

h-index: 4

최근의 코딩 에이전트는 간단한 프롬프트로부터 전체 코드베이스를 생성할 수 있지만, 기존의 평가 방법은 문제 해결 수준의 버그 수정에 집중되어 있으며, 엔드 투 엔드 개발 측면에서는 부족합니다. 본 논문에서는 프로젝트 요구사항을 코딩 에이전트에 제공하고 결과 저장소를 평가하는 엔드 투 엔드 벤치마크인 ProjDevBench를 소개합니다. 본 벤치마크는 온라인 저지(Online Judge, OJ) 테스트와 LLM 기반 코드 검토를 결합하여, (1) 시스템 아키텍처 설계, (2) 기능적 정확성, 및 (3) 반복적인 솔루션 개선 능력을 평가합니다. 8가지 범주에 걸쳐 20개의 프로그래밍 문제를 선정하고, 개념 학습 과제와 실제 응용 시나리오를 모두 포함하며, 다양한 LLM 백엔드를 기반으로 구축된 6개의 코딩 에이전트를 평가했습니다. 평가 결과, 전반적인 성공률은 27.38%로 나타났습니다. 에이전트는 기본적인 기능 및 데이터 구조를 처리하는 데는 능숙하지만, 복잡한 시스템 설계, 시간 복잡도 최적화, 그리고 자원 관리 측면에서는 어려움을 겪는 것으로 나타났습니다. 본 벤치마크는 https://github.com/zsworld6/projdevbench 에서 이용 가능합니다.

Original Abstract

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.

4 Citations

0 Influential

36.324746787308 Altmetric

185.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!