2604.09836v1 Apr 10, 2026 cs.AI

COMPOSITE-STEM: 복합 지식 기반 문제 해결 벤치마크

COMPOSITE-Stem

Yuqi Li

Citations: 3,605

h-index: 4

Kyle Waters

Citations: 4

h-index: 1

Lucas Nuzzi

Citations: 12

h-index: 2

Tadhg Looram

Citations: 4

h-index: 1

A. Tomasiello

Citations: 6,514

h-index: 45

A. Kamdoum

Citations: 342

h-index: 1

Bikun Li

Citations: 345

h-index: 2

Damien Sileo

Citations: 7

h-index: 2

E. Kretov

Citations: 1

h-index: 1

Francesco Fournier-Facio

Citations: 350

h-index: 2

Georgios Soloupis

Citations: 1

h-index: 1

Haile Kassahun

Citations: 1

h-index: 1

Hew Wolff

Citations: 342

h-index: 1

Marc Roth

Citations: 345

h-index: 2

M. Naiya

Citations: 357

h-index: 3

N. Guo

Citations: 150

h-index: 5

Richard G. Wheeler

Citations: 1

h-index: 1

Samuele Sala

Citations: 20

h-index: 3

S. Popov

Citations: 3

h-index: 1

Jiaqi Cai

Citations: 347

h-index: 2

Liang Li

Citations: 24

h-index: 2

Qi-Dong Tang

Citations: 7

h-index: 1

S. Dillmann

Citations: 2

h-index: 1

인공지능 에이전트는 과학적 발견을 가속화하는 데 큰 잠재력을 가지고 있지만, 실제 워크플로우에 적용하는 데에는 아직 초기 단계 평가의 부족이라는 어려움이 존재합니다. 전문가가 작성한 벤치마크는 AI의 추론 능력을 측정하는 데 효과적이었지만, 대부분의 벤치마크는 현재 제한된 출력에 대한 성능만 측정하는 수준에 이르렀습니다. 이러한 격차를 해소하기 위해, 우리는 박사급 연구원들이 선별한 물리학, 생물학, 화학, 수학 분야의 70개의 문제로 구성된 벤치마크인 COMPOSITE-STEM을 소개합니다. COMPOSITE-STEM은 정확 일치 평가와 기준 기반 평가 방식을 결합하고, LLM(Large Language Model)을 심판으로 활용하여 과학적으로 의미 있는 다양한 출력에 대한 유연한 평가를 가능하게 합니다. Harbor 에이전트 평가 프레임워크 내에서 수정된 멀티모달 Terminus-2 에이전트 시스템을 사용하여 네 가지 최첨단 모델을 평가했습니다. 가장 뛰어난 성능을 보인 모델은 21%의 정확도를 달성했으며, 이는 COMPOSITE-STEM이 현재 에이전트의 역량을 뛰어넘는 기능을 측정할 수 있음을 보여줍니다. 모든 문제는 재현성을 확보하고, 해당 분야에서 인공지능이 과학적 발전을 가속화하는 데 기여할 수 있도록, 기여자들의 허가를 받아 공개됩니다.

Original Abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

1 Citations

0 Influential

22.5 Altmetric

113.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!