2604.17771v1 Apr 20, 2026 cs.CL

SPENCE: NL2SQL 벤치마크의 오염 감지를 위한 구문 분석 기반 방법

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

Hitesh Laxmichand Patel

Oracle

Citations: 184

h-index: 8

Dan Roth

Citations: 13

h-index: 2

Mohammad Safarzadeh

Citations: 8

h-index: 2

Afshin Orojlooyjadid

Citations: 0

h-index: 0

Graham Horwood

Citations: 38

h-index: 3

최근 대규모 언어 모델(LLM)은 자연어-SQL(NL2SQL) 벤치마크에서 뛰어난 성능을 보이고 있지만, 보고된 정확도가 훈련 과정에서 벤치마크 질의 또는 구조적으로 유사한 패턴으로 인한 오염으로 인해 과장되었을 수 있습니다. 본 연구에서는 NL2SQL 오염의 영향을 감지하고 정량화하기 위한 제어된 구문 분석 프레임워크인 SPENCE(Syntactic Probing and Evaluation of NL2SQL Contamination Effects)를 소개합니다. SPENCE는 널리 사용되는 네 가지 NL2SQL 데이터셋(Spider, SParC, CoSQL, 그리고 비교적 새로운 BIRD 벤치마크)의 테스트 질의에 대한 다양한 구문 변형을 체계적으로 생성합니다. SPENCE를 사용하여 여러 개의 고용량 LLM을 실행 기반 점수를 통해 평가합니다. 각 모델에 대해, SPENCE는 증가하는 수준의 구문적 차이에 따른 실행 정확도의 변화를 측정하고, 켄달의 타우(Kendall's tau)와 부트스트랩 신뢰 구간을 사용하여 순위 민감도를 정량화합니다. 이러한 견고성 추세를 벤치마크 출시 날짜와 비교한 결과, 명확한 시간적 경향을 관찰할 수 있었습니다. 예를 들어, Spider와 같은 오래된 벤치마크는 가장 낮은 값을 나타내어 훈련 데이터 누출 가능성이 가장 높으며, 반면 비교적 최신의 BIRD 데이터셋은 최소한의 민감도를 보여 오염되지 않았을 가능성이 큽니다. 이러한 결과는 신뢰할 수 있는 NL2SQL 벤치마킹을 위한 시간적 맥락을 고려한 구문 분석 기반 평가의 중요성을 강조합니다.

Original Abstract

Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall's tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!