2602.07900v1 Feb 08, 2026 cs.SE

LLM 기반 소프트웨어 엔지니어링 에이전트를 위한 에이전트 생성 테스트의 가치 재고

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Yuling Shi

Citations: 398

h-index: 11

Chao Peng

Citations: 502

h-index: 10

Zhi Chen

Citations: 52

h-index: 4

Zhensu Sun

Citations: 536

h-index: 11

Xiaodong Gu

Citations: 217

h-index: 7

David Lo

Citations: 13

h-index: 2

Lingxiao Jiang

Citations: 292

h-index: 6

대규모 언어 모델(LLM) 기반 코드 에이전트는 코드를 반복적으로 수정하고, 도구를 호출하며, 후보 패치를 검증하는 방식으로 점점 더 많은 저장소 수준의 문제를 해결하고 있습니다. 이러한 워크플로우에서 에이전트는 종종 즉석에서 테스트를 작성하는데, 이는 SWE-bench 순위 상위 에이전트 중 많은 에이전트들이 채택하는 방식입니다. 그러나 GPT-5.2는 거의 새로운 테스트를 작성하지 않음에도 불구하고, 최상위 에이전트와 유사한 성능을 달성할 수 있다는 것을 확인했습니다. 이는 다음과 같은 중요한 질문을 제기합니다. 즉, 이러한 테스트가 실제로 문제 해결을 의미 있게 개선하는 것인지, 아니면 인간의 테스트 방식을 단순히 모방하는 것인지, 그리고 상당한 상호 작용 비용을 소비하는 것인지 말입니다. 에이전트가 작성한 테스트의 영향을 파악하기 위해, SWE-bench Verified에서 최첨단 LLM 6가지 에이전트의 작동 방식을 분석하는 실증 연구를 수행했습니다. 그 결과, 테스트 작성이 널리 사용되는 반면, 동일한 모델 내에서 해결된 작업과 해결되지 않은 작업이 유사한 테스트 작성 빈도를 보이는 것을 확인했습니다. 또한, 이러한 테스트는 일반적으로 에이전트에게 관찰 기반 피드백을 제공하는 역할을 하며, 에이전트는 형식적인 검증 기반 확인보다 값(결과)을 보여주는 출력 문장을 훨씬 더 선호하는 경향이 있습니다. 이러한 통찰력을 바탕으로, 우리는 4개의 에이전트의 프롬프트를 수정하여 테스트 작성 빈도를 높이거나 줄이는 통제된 실험을 수행했습니다. 그 결과, 에이전트가 작성하는 테스트의 양이 최종 결과에 큰 영향을 미치지 않는다는 것을 알 수 있었습니다. 종합적으로, 우리의 연구는 현재의 테스트 작성 방식이 자율적인 소프트웨어 엔지니어링 작업에서 제한적인 유용성만 제공할 수 있다는 것을 보여줍니다.

Original Abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.

3 Citations

1 Influential

5.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!