2605.02503v1 May 04, 2026 cs.AI

DataClaw: 실질적인 데이터 분석을 위한 프로세스 기반 에이전트 벤치마크

DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis

Weihua Zheng

Citations: 23

h-index: 3

Jialong Chen

Citations: 79

h-index: 5

Chuan Chen

Citations: 55

h-index: 4

Weihao Ye

Citations: 7

h-index: 2

Boyuan Li

Citations: 0

h-index: 0

Bowen Deng

Citations: 29

h-index: 4

Zibin Zheng

Citations: 402

h-index: 9

Qiao Zhang

Citations: 2

h-index: 1

Yi Luo

Citations: 1,367

h-index: 13

Jian Lin

Citations: 5

h-index: 2

자율 데이터 분석 에이전트의 성능을 평가하려면, 미개척된 데이터 환경에서 탐색적 분석 능력을 테스트해야 합니다. 그러나 기존 벤치마크의 상당수는 사전 지식이 있는 데이터 환경에서 최종 답변의 정확성에 초점을 맞추고 있으며, 추론 과정 평가에 대한 제한적인 지원만 제공합니다. 본 논문에서는 실질적인 데이터 분석을 위한 프로세스 기반 벤치마크인 DataClaw를 소개합니다. DataClaw는 기업, 산업, 정책 분야의 약 206만 건의 실제 데이터를 포함하며, 원본 데이터의 노이즈가 그대로 유지되어 있습니다. 또한, 492개의 교차 분야 과제를 포함하며, 각 과제는 프로세스 수준 평가를 위한 중간 단계 마일스톤으로 주석이 달려 있습니다. 이러한 주석을 통해 DataClaw는 에이전트가 얼마나 발전하고 어떤 부분에서 추론이 실패하는지 측정할 수 있습니다. 8개의 최첨단 LLM을 사용한 실험 결과, 현재 에이전트는 이 환경에서 여전히 신뢰성이 낮으며, 7개의 모델이 전체적으로 50% 미만의 정확도를 보였습니다. 추가적인 프로세스 분석을 통해, 잘못된 답변 뒤에 숨겨진 부분적인 발전과 모델 간의 뚜렷한 탐색 전략이 드러났습니다. 전반적으로, DataClaw는 자율 데이터 분석 에이전트의 능력 한계를 탐색하기 위한, 데이터 제약이 적은 진단 테스트 환경을 제공합니다.

Original Abstract

Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data settings and provide limited support for reasoning process evaluation. We introduce DataClaw, a process-oriented benchmark for exploratory real-world data analysis. DataClaw contains approximately 2.06 million real-world records across enterprise, industry and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones for process-level evaluation. These annotations allow DataClaw to measure how far an agent progresses and where its reasoning breaks down. Experiments with eight advanced LLMs show that current agents remain far from reliable in this setting, with seven models achieving below 50% overall accuracy. Process analysis further reveals partial progress hidden behind wrong answers and distinct exploration strategies across models. Overall, DataClaw provides a less data constrained diagnostic testbed for probing the capability boundaries of autonomous data-analysis agents.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!