2603.13428v1 Mar 13, 2026 cs.SE

EvoClaw: 지속적인 소프트웨어 진화를 위한 AI 에이전트 평가

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

Zhongming Yu

Citations: 122

h-index: 4

Yuhong Liu

Citations: 284

h-index: 3

Yuxin Yang

Citations: 18

h-index: 2

Gangda Deng

Citations: 72

h-index: 3

Viktor K. Prasanna

Citations: 59

h-index: 2

Zhaoling Chen

Citations: 81

h-index: 3

Haoyang Fan

Citations: 14

h-index: 3

Dhruv Parikh

Citations: 68

h-index: 6

Rajgopal Kannan

Citations: 196

h-index: 8

Le Cong

Citations: 453

h-index: 10

Mengdi Wang

Citations: 614

h-index: 11

Qian Zhang

Citations: 12

h-index: 1

Xiangru Tang

Citations: 19

h-index: 4

Xingyao Wang

Citations: 164

h-index: 4

AI 에이전트가 장기적으로 운영되는 시스템으로 점점 더 많이 사용됨에 따라, 동적인 환경 내에서 상호 작용을 가능하게 하기 위해 맞춤형 소프트웨어를 자율적으로 구축하고 지속적으로 발전시키는 것이 필수적입니다. 그러나 기존 벤치마크는 에이전트를 격리된, 일회성 코딩 작업에 대해 평가하며, 실제 소프트웨어 진화에 내재된 시간 의존성과 기술 부채를 간과합니다. 이러한 격차를 해소하기 위해, 우리는 의미적으로 일관된 개발 목표를 의미하는 마일스톤(Milestone) DAG(Directed Acyclic Graph)를 노이즈가 많은 커밋 로그에서 재구성하는 에이전트 기반 파이프라인인 DeepCommit을 소개합니다. 이러한 실행 가능한 시퀀스는 EvoClaw라는 새로운 벤치마크를 가능하게 하며, EvoClaw은 에이전트가 시스템 무결성을 유지하고 오류 축적을 제한하도록 요구합니다. 이는 현재 벤치마크에서 크게 부족한 장기 소프트웨어 진화의 중요한 측면입니다. 4개의 에이전트 프레임워크에서 12개의 최첨단 모델을 평가한 결과, 중요한 취약점이 드러났습니다. 즉, 격리된 작업에서는 80% 이상의 전반적인 성능 점수를 보였지만, 지속적인 환경에서는 최대 38%로 성능이 크게 저하되었으며, 이는 에이전트가 장기적인 유지 관리 및 오류 전파에 직면하는 심각한 어려움을 보여줍니다.

Original Abstract

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as semantically cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from $>$80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!