2605.04615v2 May 06, 2026 cs.SE

검색을 넘어: 다중 작업 벤치마크 및 코드 검색 모델

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Ziyin Zhang

Citations: 404

h-index: 9

Zihan Liao

Citations: 52

h-index: 5

Siqiao Xue

Citations: 938

h-index: 15

Jin Qin

Citations: 8

h-index: 2

Yixiang Mu

Citations: 26

h-index: 3

Fan Zhou

Citations: 115

h-index: 6

Hang Yu

Citations: 81

h-index: 3

코드 검색은 일반적으로 첫 번째 단계로의 검색으로 평가되어 왔지만, 실제 시스템은 재순위화 및 개발자 스타일의 쿼리를 포함하는 더 광범위한 파이프라인에 의존합니다. 기존 벤치마크는 데이터 오염, 레이블 노이즈 및 극단적인 이진 관련성 문제를 안고 있습니다. 본 논문에서는 오염을 최소화하고 다중 작업을 수행하는 코드 검색 및 재순위화 벤치마크인 extsc{CoREB}를 소개합니다. extsc{CoREB}는 전체 코드 검색 파이프라인을 포괄하기 위해 검색을 넘어섭니다. extsc{CoREB}는 다섯 가지 프로그래밍 언어의 가상으로 재작성된 LiveCodeBench 문제를 기반으로 구축되었으며, 등급이 매겨진 관련성 판단과 함께 시간 간격을 두고 제공됩니다. 우리는 열한 개의 임베딩 모델과 다섯 개의 재순위화 모델을 세 가지 작업(텍스트-코드, 코드-텍스트, 코드-코드)에 대해 벤치마킹했습니다. 우리의 실험 결과는 다음과 같습니다: 1) 코드 전문 임베딩이 코드-코드 검색에서 일반 인코더보다 약 2배 더 우수한 성능을 보이지만, 단일 모델이 세 가지 작업 모두에서 가장 좋은 성능을 보이지 않습니다. 2) 실제 개발자 검색에 가장 가까운 짧은 키워드 쿼리는 모든 모델의 nDCG@10을 거의 0으로 만듭니다. 3) 상용 재순위화 모델은 작업에 따라 성능이 다르고, 코드-코드 작업에서 최대 12점의 차이를 보이며, 어떤 기본 모델도 모든 작업에서 긍정적인 결과를 나타내지 않습니다. 4) 우리의 미세 조정된 extsc{CoREB-Reranker}는 세 가지 작업 모두에서 일관된 성능 향상을 달성한 첫 번째 모델입니다. 데이터와 모델은 공개됩니다.

Original Abstract

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

1 Citations

0 Influential

7.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!