2603.29399v1 Mar 31, 2026 cs.AI

ELT-Bench-Verified: 벤치마크 품질 문제로 인해 AI 에이전트의 성능이 과소평가됨

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Tengjun Jin

Citations: 68

h-index: 4

Yotam Perlitz

Citations: 292

h-index: 9

Christopher Zanoli

Citations: 3

h-index: 1

Andrea Giovannini

Citations: 24

h-index: 2

A. Klimovic

Citations: 574

h-index: 12

추출-로드-변환(ELT) 파이프라인 구축은 많은 노동력을 요구하는 데이터 엔지니어링 작업이며, AI 자동화의 중요한 목표입니다. ELT 파이프라인 구축을 위한 최초의 벤치마크인 ELT-Bench에서, AI 에이전트는 초기 성공률이 낮아 실제 유용성이 부족하다는 인상을 주었습니다. 본 연구에서는 이러한 결과를 재검토하고, 에이전트 성능을 현저히 과소평가하는 두 가지 요인을 식별했습니다. 첫째, 최신 대규모 언어 모델을 사용하여 ELT-Bench를 재평가한 결과, 추출 및 로드 단계는 대부분 해결되었으며, 변환 성능은 크게 향상되었습니다. 둘째, 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(인터-어노테이터 일치 Fleiss' kappa = 0.85)을 결합한 감사-수정 방법론을 개발하여 벤치마크 품질을 평가했습니다. 이 방법론을 ELT-Bench에 적용한 결과, 대부분의 변환 작업 실패는 벤치마크 자체에 기인하는 오류(예: 경직된 평가 스크립트, 모호한 사양, 부정확한 정답)로 인해 발생하며, 이는 정확한 에이전트 결과를 불이익으로 처리한다는 것을 확인했습니다. 이러한 결과를 바탕으로, 평가 로직을 개선하고 정답을 수정하여 ELT-Bench-Verified라는 수정된 벤치마크를 구축했습니다. 이 버전으로 다시 평가한 결과, 벤치마크 수정으로 인한 상당한 성능 향상이 관찰되었습니다. 이러한 결과는 AI 모델의 빠른 발전과 벤치마크 품질 문제 모두가 에이전트 성능을 과소평가하는 데 기여했음을 보여줍니다. 더 넓은 관점에서, 본 연구의 결과는 텍스트-SQL 벤치마크에서 널리 발견되는 어노테이션 오류 현상을 반영하며, 데이터 엔지니어링 평가에서 품질 문제가 근본적인 문제임을 시사합니다. 복잡한 에이전트 작업의 경우, 체계적인 품질 감사는 표준적인 관행이 되어야 합니다. 본 연구에서는 AI 기반 데이터 엔지니어링 자동화 분야의 발전을 위한 더욱 신뢰할 수 있는 기반을 제공하기 위해 ELT-Bench-Verified를 공개합니다.

Original Abstract

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!