2601.08778v3 Jan 13, 2026 cs.AI

만연한 주석 오류는 Text-to-SQL 벤치마크와 리더보드를 훼손한다

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Tengjun Jin

Citations: 68

h-index: 4

Yoojin Choi

Citations: 9

h-index: 2

Yuxuan Zhu

Citations: 162

h-index: 6

Daniel Kang

Citations: 165

h-index: 6

연구자들은 데이터 분석을 간소화하고 데이터 기반 애플리케이션 개발을 가속화하기 위해 수많은 Text-to-SQL 기술을 제안해 왔다. 이러한 기술들을 비교하고 실제 배포를 위한 최적의 기술을 선택하기 위해, 커뮤니티는 공개 벤치마크와 그 리더보드에 의존한다. 이러한 벤치마크는 질문 구성 및 답변 평가 과정에서 사람의 주석(annotation)에 크게 의존하기 때문에, 주석의 타당성은 매우 중요하다. 본 논문에서는 (i) 널리 사용되는 두 가지 Text-to-SQL 벤치마크인 BIRD와 Spider 2.0-Snow의 주석 오류율을 벤치마킹하고, (ii) 주석 오류가 Text-to-SQL 에이전트 성능 및 리더보드 순위에 미치는 영향을 측정하기 위해 BIRD 개발(Dev) 세트의 일부를 수정하는 실증 연구를 수행한다. 전문가 분석을 통해 우리는 BIRD Mini-Dev와 Spider 2.0-Snow가 각각 52.8%와 62.8%의 오류율을 보임을 입증한다. 우리는 BIRD 리더보드에 있는 16개 오픈 소스 에이전트 모두를 원본 BIRD Dev 서브셋과 수정된 서브셋에서 재평가했다. 그 결과 성능 변화는 (상대적 관점에서) -7%에서 31%에 이르며, 순위 변화는 -9위에서 +9위까지 변동함을 확인했다. 또한 이러한 영향이 전체 BIRD Dev 세트에도 일반화되는지 평가한다. 우리는 수정되지 않은 서브셋에서의 에이전트 순위가 전체 Dev 세트의 순위와 강한 상관관계(Spearman's $r_s$=0.85, $p$=3.26e-5)를 보이는 반면, 수정된 서브셋의 순위와는 약한 상관관계(Spearman's $r_s$=0.32, $p$=0.23)를 보임을 발견했다. 이러한 연구 결과는 주석 오류가 보고된 성능과 순위를 심각하게 왜곡할 수 있으며, 잠재적으로 연구 방향이나 배포 선택을 잘못된 방향으로 이끌 수 있음을 보여준다. 우리의 코드와 데이터는 https://github.com/uiuc-kang-lab/text_to_sql_benchmarks 에서 이용 가능하다.

Original Abstract

Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of data-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.

6 Citations

0 Influential

35.42453324894 Altmetric

183.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!