2602.06855v3 Feb 06, 2026 cs.AI

AIRS-Bench: 최첨단 AI 연구를 위한 과학 에이전트 평가 도구 모음

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

R. Raileanu

Citations: 25,277

h-index: 30

A. Budhiraja

Citations: 144

h-index: 5

A. Lupidi

Citations: 97

h-index: 5

Bhavul Gauri

Citations: 28

h-index: 3

Thomas Foster

Citations: 108

h-index: 4

Bassel Al Omari

Citations: 66

h-index: 4

Despoina Magka

Citations: 488

h-index: 10

Alexis Audran-Reiss

Citations: 55

h-index: 4

Muna Aghamelu

Citations: 11

h-index: 1

Jean-Christophe Gagnon-Audet

Citations: 298

h-index: 7

C. Leow

Citations: 964

h-index: 16

Sandra Lefdal

Citations: 3,341

h-index: 3

Hossam Mossalam

Citations: 195

h-index: 2

A. Moudgil

Citations: 297

h-index: 6

S. Nazir

Citations: 107

h-index: 5

Emanuel Tewolde

Citations: 17

h-index: 2

Isabel Urrego

Citations: 11

h-index: 1

J. Estapé

Citations: 112

h-index: 4

Gaurav Chaurasia

Citations: 107

h-index: 4

Abhishek Charnalia

Citations: 114

h-index: 5

Derek Dunfield

Citations: 52

h-index: 3

K. Hambardzumyan

Citations: 11

h-index: 1

Daniel Izcovich

Citations: 20

h-index: 2

Martin Josifoski

Citations: 1,164

h-index: 12

Ishita Mediratta

Citations: 840

h-index: 8

Kelvin Niu

Citations: 252

h-index: 5

Parth Pathak

Citations: 23

h-index: 3

Michael Shvartsman

Citations: 74

h-index: 5

Edan Toledo

Citations: 69

h-index: 4

Anton Protopopov

Citations: 16

h-index: 2

Alexander H. Miller

Citations: 60

h-index: 4

T. Shavrina

Citations: 11

h-index: 1

Jakob Foerster

Citations: 126

h-index: 4

Yoram Bachrach

Citations: 6,399

h-index: 42

A. Pepe

Citations: 2,708

h-index: 18

Nicola Baldwin

Citations: 25

h-index: 2

Lucia Cipolina-Kun

Citations: 118

h-index: 5

대규모 언어 모델(LLM) 에이전트는 과학 연구 발전에 큰 잠재력을 가지고 있습니다. 이러한 발전을 가속화하기 위해, 우리는 AIRS-Bench (AI Research Science Benchmark)를 소개합니다. AIRS-Bench는 최첨단 머신러닝 논문에서 추출한 20개의 작업으로 구성된 평가 도구 모음입니다. 이러한 작업들은 언어 모델링, 수학, 생물정보학, 시계열 예측 등 다양한 분야를 포괄합니다. AIRS-Bench 작업은 아이디어 생성, 실험 분석 및 반복적인 개선과 같은 연구 생명 주기 전반에 걸쳐 에이전트의 능력을 평가하며, 기준 코드를 제공하지 않습니다. AIRS-Bench 작업 형식이 유연하여 새로운 작업을 쉽게 통합하고 다양한 에이전트 프레임워크 간의 엄격한 비교를 가능하게 합니다. 우리는 최첨단 모델과 순차적 및 병렬 구조를 결합하여 기준 성능을 설정했습니다. 우리의 결과는 에이전트가 4개의 작업에서 인간 최고 성능(SOTA)을 능가하지만, 16개의 작업에서는 그렇지 않음을 보여줍니다. 에이전트가 인간 벤치마크를 능가하더라도, 해당 작업의 이론적인 성능 한계에 도달하지 못합니다. 이러한 결과는 AIRS-Bench가 아직 발전의 여지가 많다는 것을 시사합니다. 우리는 AIRS-Bench 작업 정의 및 평가 코드를 공개하여 자율적인 과학 연구 분야의 추가적인 발전을 촉진하고자 합니다.

Original Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

11 Citations

1 Influential

21 Altmetric

118.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!