2601.16344v1 Jan 22, 2026 cs.AI

DSGym: 데이터 과학 에이전트 평가 및 훈련을 위한 통합 프레임워크

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

Yongchan Kwon

Citations: 70

h-index: 4

Federico Bianchi

Citations: 556

h-index: 9

Harper Hua

Citations: 22

h-index: 2

Fan Nie

Citations: 52

h-index: 5

Junlin Wang

Citations: 49

h-index: 4

Zhenting Qi

Citations: 14

h-index: 2

Owen Queen

Citations: 8

h-index: 2

Shang Zhu

Citations: 8

h-index: 2

James Zou

Citations: 10

h-index: 2

데이터 과학 에이전트는 데이터를 실행 가능한 분석 및 결과로 변환하여 발견과 통찰력 창출을 가속화할 수 있는 잠재력을 가지고 있습니다. 그러나 기존의 데이터 과학 벤치마크는 단편적인 평가 인터페이스로 인해 벤치마크 간 비교가 어렵고, 제한적인 작업 범위 및 엄격한 데이터 기반 지식 부족 등의 문제점을 가지고 있습니다. 특히, 현재 벤치마크의 상당수의 작업은 실제 데이터를 사용하지 않고도 해결될 수 있음을 보여줍니다. 이러한 제한 사항을 해결하기 위해, 우리는 자체적으로 격리된 실행 환경에서 데이터 과학 에이전트를 평가하고 훈련하기 위한 표준화된 프레임워크인 DSGym을 소개합니다. DSGym은 정적인 벤치마크와 달리, 모듈화된 아키텍처를 제공하여 작업, 에이전트 템플릿 및 도구를 쉽게 추가할 수 있으며, 이는 실시간으로 확장 가능한 테스트 환경으로서의 역할을 수행합니다. 우리는 기존 벤치마크의 품질 및 단축 경로 해결 가능성을 필터링하여 표준화하고 개선한 포괄적인 작업 모음인 DSGym-Tasks를 제공합니다. 또한, (1) 문헌 기반의 전문가가 설계한 생물정보학 작업인 DSBio와 (2) 컴퓨터 비전, 분자 예측 및 단일 세포 교란 등 다양한 분야를 포괄하는 어려운 예측 작업인 DSPredict를 추가하여 작업 범위를 확장했습니다. 평가 외에도, DSGym은 실행을 통해 검증된 데이터 합성 파이프라인을 통해 에이전트 훈련을 지원합니다. 사례 연구로, 우리는 DSGym 내에서 2,000개의 예제로 구성된 훈련 데이터 세트를 구축하고 4B 모델을 훈련시켰으며, 이 모델은 표준화된 분석 벤치마크에서 GPT-4o보다 우수한 성능을 보였습니다. 전반적으로, DSGym은 에이전트가 실제 과학적 맥락에서 데이터 분석을 계획, 실행 및 검증할 수 있는지에 대한 엄격한 종단 간 측정을 가능하게 합니다.

Original Abstract

Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.

3 Citations

0 Influential

4.5 Altmetric

25.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!