2601.18119v1 Jan 26, 2026 cs.AI

Text-to-SQL을 넘어서: LLM이 실제로 기업용 ETL SQL을 디버깅할 수 있는가?

Beyond Text-to-SQL: Can LLMs Really Debug Enterprise ETL SQL?

Jing Ye

Citations: 0

h-index: 0

Yonghong Yu

Citations: 3

h-index: 1

Victor Ma

Citations: 8

h-index: 1

Yiwen Duan

Citations: 3

h-index: 1

Yang Gao

Citations: 2,779

h-index: 5

Xing Chen

Citations: 5

h-index: 1

SQL은 기업 데이터 엔지니어링의 핵심이지만, 숙련된 개발자나 고급 Text-to-SQL LLM조차도 한 번의 시도로 완전히 정확한 SQL 코드를 생성하는 것은 여전히 어려우며, 종종 여러 번의 디버깅 반복 과정을 필요로 합니다. 본 논문에서는 기업 수준의 SQL 추론 및 디버깅을 위한 최초의 벤치마크인 OurBench를 소개합니다. 이 벤치마크는 두 가지 핵심 혁신을 기반으로 구축되었습니다. (1) 대규모 SQL 코드에 리버스 엔지니어링을 적용하여 현실적인 버그를 체계적으로 주입함으로써 확장 가능하고 다양한 벤치마크 생성을 가능하게 하는 자동화된 구축 워크플로, (2) 기업 환경에 최적화되어 빠르고 정확하며 자원 효율적인 평가를 제공하는 무실행(execution-free) 평가 프레임워크입니다. OurBench는 명시적인 오류 메시지가 포함된 구문 오류를 다루는 469개의 OurBenchSyn 쿼리와, 코드가 사용자 의도를 충족하지 못하는 의미적 오류를 다루는 516개의 OurBenchSem 쿼리로 구성됩니다. 이 쿼리들은 평균 140줄이 넘고 깊고 넓은 추상 구문 트리(AST)를 갖는 등 매우 높은 복잡도를 보입니다. 약 30개의 LLM을 평가한 결과 상당한 성능 격차가 드러났는데, 최고 성능을 보인 Claude-4-Sonnet조차 OurBenchSyn에서 36.46%, OurBenchSem에서 32.17%의 정확도를 기록하는 데 그쳤으며, 대부분의 모델은 20% 미만의 점수를 기록했습니다. 마지막으로 우리는 네 가지 해결 전략을 탐구하고, 주요 과제를 식별하며, LLM을 활용한 기업용 SQL 디버깅을 위한 유망한 연구 방향을 제시합니다.

Original Abstract

SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult, even for experienced developers and advanced text-to-SQL LLMs, often requiring multiple debugging iterations. We introduce OurBench, the first benchmark for enterprise-level SQL reasoning and debugging. Our benchmark is built on two key innovations: (1) an automated construction workflow that uses reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an execution-free evaluation framework tailored to enterprise settings, providing fast, accurate, and resource-efficient assessment. OurBench comprises 469 OurBenchSyn queries featuring syntax errors with explicit error messages, and 516 OurBenchSem queries targeting semantic errors in which the code fails to meet user intent. The queries are highly complex, averaging over 140 lines and featuring deep and wide abstract syntax trees. Evaluation of nearly 30 LLMs reveals a substantial performance gap: the best-performing model, Claude-4-Sonnet, achieves only 36.46 percent accuracy on OurBenchSyn and 32.17 percent on OurBenchSem, while most models score below 20 percent. We further explore four solution strategies, identify key challenges, and outline promising directions for enterprise SQL debugging with LLMs.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!