2602.08316v1 Feb 09, 2026 cs.SE

SWE Context Bench: 코딩 작업에서의 문맥 학습을 위한 벤치마크

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu

Citations: 7

h-index: 1

Minhao Hu

Citations: 21

h-index: 2

Junde Wu

Citations: 934

h-index: 9

최근 대규모 언어 모델은 코드 저장소 수준의 소프트웨어 엔지니어링 작업에 점점 더 많이 활용되고 있습니다. 기존 벤치마크들은 실제 코드베이스에서의 정확성을 평가하지만, 대부분의 작업들을 독립적으로 취급하며, 에이전트가 관련 문제에서 이전 경험을 재사용할 수 있는지 여부를 평가하지 않습니다. 따라서 에이전트가 이전 경험을 축적, 검색, 적용하는 능력과 이러한 재사용으로 인한 효율성 향상은 측정하기 어렵습니다. 본 논문에서는 프로그래밍 에이전트의 경험 재사용을 명시적으로 평가하기 위해 설계된 벤치마크인 SWE-ContextBench를 소개합니다. SWE-ContextBench는 SWE-Bench Lite를 기반으로 구축되었으며, GitHub 이슈 및 풀 리퀘스트 간의 실제 의존성 및 참조 관계에서 파생된 99개의 관련 작업을 300개의 기본 작업에 추가하여, 공유된 문맥을 가진 작업 시퀀스를 형성합니다. 본 벤치마크는 예측 정확도, 시간 효율성, 비용 효율성의 세 가지 상호 보완적인 측면에서 에이전트를 평가합니다. SWE-ContextBench를 사용하여, 오라클 기반 및 자율 검색을 포함한 다양한 경험 재사용 환경과, 전체 실행 경로 및 간결한 요약 정보를 분석했습니다. 실험 결과, 올바르게 선택된 요약된 경험은 문제 해결 정확도를 향상시키고, 특히 어려운 작업에서 실행 시간과 토큰 비용을 크게 줄이는 것으로 나타났습니다. 반면, 필터링되지 않거나 잘못 선택된 경험은 제한적이거나 부정적인 효과만 가져옵니다. 이러한 결과는 경험 표현 및 검색 품질의 중요성을 강조하며, SWE-ContextBench를 프로그래밍 에이전트의 경험 재사용 연구를 위한 체계적인 벤치마크로 자리매김합니다.

Original Abstract

Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.

6 Citations

0 Influential

4.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!