2604.27977v1 Apr 30, 2026 cs.AI

D3-Gym: 데이터 기반 발견을 위한 실제 세계의 검증 가능한 환경 구축

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Yankai Yang

Citations: 22

h-index: 3

Huan Sun

Citations: 52

h-index: 4

Nesreen K. Ahmed

Citations: 168

h-index: 7

Tianshu Zhang

Citations: 393

h-index: 4

Ali Payani

Citations: 53

h-index: 4

Cheng Tang

Citations: 177

h-index: 6

Hanane Nour Moussa

Citations: 79

h-index: 4

Yifei Li

Citations: 184

h-index: 2

Zhuoyang Li

Citations: 1

h-index: 1

Ziru Chen

Citations: 43

h-index: 2

최근 언어 모델 및 과학 데이터 기반 발견을 위한 에이전트의 발전이 있었지만, 실제 과학적 작업을 대표하는 검증 가능한 환경의 부재로 인해 그 발전이 제한되고 있습니다. 이러한 격차를 해소하기 위해, 우리는 과학 데이터 기반 발견을 위한 최초의 자동 생성 데이터셋인 D3-Gym을 소개합니다. D3-Gym은 (1) 239개의 실제 과학 저장소에서 가져온 565개의 작업으로 구성되며, (2) 각 작업은 자연어 지침, 사전 설치된 종속성이 포함된 실행 환경, 입력 데이터셋 및 결과물 미리보기, 참조 코드 솔루션, 그리고 자동으로 생성된 평가 스크립트를 갖추고 있습니다. D3-Gym 내 검증 신호의 품질에 대한 엄격한 평가는, 우리의 평가 스크립트가 인간이 작성한 표준과 87.5%의 일치도를 보이며, 특정 분야의 평가 논리에서도 높은 일관성을 보이는 것을 확인했습니다. 또한, D3-Gym에서 샘플링된 데이터를 사용하여 학습한 결과, 다양한 크기의 Qwen3 모델에서 ScienceAgentBench 벤치마크에서 일관되고 상당한 성능 향상을 보였으며, 특히 Qwen3-32B 모델의 성능이 7.8 포인트 향상되어 강력한 독점 모델과의 격차가 크게 줄었습니다. D3-Gym의 모든 구성 요소(환경, 생성 워크플로우, 학습 데이터, 모델)는 https://github.com/OSU-NLP-Group/D3-Gym에서 확인할 수 있습니다.

Original Abstract

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks.To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

0 Citations

0 Influential

32.45879734614 Altmetric

162.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!