2602.24288v1 Feb 27, 2026 cs.AI

DARE-bench: 데이터 과학 분야에서 LLM의 모델링 및 지시 준수도를 평가하는 벤치마크

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Boyi Liu

Citations: 12

h-index: 2

Yite Wang

Citations: 38

h-index: 2

Yuxiong He

Citations: 1,532

h-index: 18

Feng Yan

Citations: 1,070

h-index: 3

Fanyuan Shu

Citations: 0

h-index: 0

Ruofan Wu

Citations: 12

h-index: 2

Zhewei Yao

Citations: 338

h-index: 9

대규모 언어 모델(LLM)을 활용하여 복잡하고 다단계의 데이터 과학 작업을 수행하는 수요가 급증함에 따라, 정확한 성능 평가의 필요성이 대두되고 있습니다. 기존 벤치마크에는 (i) 지시 준수 및 프로세스 충실도를 측정하는 표준화된, 프로세스 기반 평가의 부족과 (ii) 정확하게 레이블링된 학습 데이터의 부족이라는 두 가지 주요 문제가 있습니다. 이러한 문제점을 해결하기 위해, 우리는 기계 학습 모델링 및 데이터 과학 지시 수행을 위한 벤치마크인 DARE-bench를 소개합니다. 많은 기존 벤치마크가 인간 또는 모델 기반 평가자를 사용하는 것과 달리, DARE-bench의 모든 작업은 검증 가능한 정답을 가지고 있어 객관적이고 재현 가능한 평가를 보장합니다. DARE-bench는 다양한 작업 범위를 포괄하고 에이전트 기반 도구를 지원하기 위해 6,300개의 Kaggle 기반 작업으로 구성되어 있으며, 대규모 학습 데이터와 평가 데이터 세트를 제공합니다. 광범위한 평가 결과, gpt-o4-mini와 같은 고성능 모델조차도, 특히 기계 학습 모델링 작업에서 좋은 성능을 달성하는 데 어려움을 겪는 것으로 나타났습니다. DARE-bench의 학습 작업을 사용하여 모델을 미세 조정하면 성능을 크게 향상시킬 수 있습니다. 예를 들어, 지도 학습 미세 조정은 Qwen3-32B의 정확도를 1.83배 향상시키고, 강화 학습은 Qwen3-4B의 정확도를 8배 이상 향상시켰습니다. 이러한 상당한 개선은 DARE-bench가 정확한 평가 벤치마크이자 중요한 학습 데이터로서의 중요성을 입증합니다.

Original Abstract

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!