2603.16124v1 Mar 17, 2026 cs.SE

SWE-QA-Pro: 레포지토리 수준의 코드 이해를 위한 대표적인 벤치마크 및 확장 가능한 학습 방법

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Ping Nie

Citations: 394

h-index: 11

Songcheng Cai

Citations: 14

h-index: 2

Z. Lyu

Citations: 399

h-index: 6

Yuansheng Ni

University of Waterloo

Citations: 4,568

h-index: 11

Xiang Chen

Citations: 71

h-index: 3

Baichuan Zhou

Citations: 193

h-index: 2

Shenzhe Zhu

Citations: 85

h-index: 4

Haozhe Wang

Citations: 153

h-index: 4

Chi Ruan

Citations: 26

h-index: 2

Benjamin Schneider

Citations: 51

h-index: 4

Weixu Zhang

Citations: 193

h-index: 3

Xiang Li

Citations: 19

h-index: 2

Andrew T. Zheng

Citations: 2

h-index: 1

Yuyu Zhang

Citations: 79

h-index: 3

Wenhu Chen

Citations: 29

h-index: 2

Yifu Lu

Citations: 195

h-index: 4

에이전트 기반의 레포지토리 수준 코드 이해는 복잡한 소프트웨어 엔지니어링 작업을 자동화하는 데 필수적이지만, 이 분야에는 신뢰할 수 있는 벤치마크가 부족합니다. 기존의 평가는 종종 중요하지만 다루어지지 않는 주제를 간과하고, 대규모 언어 모델(LLM)이 암기된 지식을 통해 쉽게 풀 수 있는 인기 레포지토리에 의존합니다. 이러한 문제를 해결하기 위해, 우리는 실행 환경을 갖춘 다양한, 중요하지만 다루어지지 않는 레포지토리를 기반으로 구축된 벤치마크인 SWE-QA-Pro를 소개합니다. 우리는 문제 기반 클러스터링을 통해 주제의 균형을 맞추어, 다루어지지 않는 작업 유형을 포괄하고, 직접 답변 방식으로 풀 수 있는 질문은 필터링하여 엄격한 난이도 보정 과정을 적용했습니다. 그 결과, 에이전트 기반 워크플로우가 직접 답변 방식에 비해 훨씬 뛰어난 성능을 보였습니다(예: Claude Sonnet 4.5 모델의 경우 약 13점 차이). 이는 에이전트 기반 코드베이스 탐색의 필요성을 입증합니다. 또한, 이러한 복잡한 동작에 대한 학습 데이터 부족 문제를 해결하기 위해, 우리는 두 단계의 학습 방법을 지원하는 확장 가능한 합성 데이터 파이프라인을 제안합니다. 첫 번째 단계는 지도 미세 조정(SFT)이고, 두 번째 단계는 AI 피드백을 활용한 강화 학습(RLAIF)입니다. 이러한 접근 방식을 통해 소규모 오픈 소스 모델이 효율적인 도구 사용 및 추론 능력을 학습할 수 있습니다. 실험 결과, 우리의 방법으로 학습된 Qwen3-8B 모델은 SWE-QA-Pro에서 GPT-4o를 2.3점 앞섰으며, 최첨단 독점 모델과의 격차를 크게 줄였습니다. 이는 우리의 평가 방법의 유효성과 에이전트 기반 학습 워크플로우의 효과성을 입증합니다.

Original Abstract

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!