2603.26499v1 Mar 27, 2026 cs.AI

AIRA_2: 인공지능 연구 에이전트의 성능 저하 요인 극복

AIRA_2: Overcoming Bottlenecks in AI Research Agents

A. Lupidi

Citations: 97

h-index: 5

Bassel Al Omari

Citations: 66

h-index: 4

Despoina Magka

Citations: 488

h-index: 10

Alexis Audran-Reiss

Citations: 55

h-index: 4

Jean-Christophe Gagnon-Audet

Citations: 298

h-index: 7

Derek Dunfield

Citations: 52

h-index: 3

Martin Josifoski

Citations: 1,164

h-index: 12

Ishita Mediratta

Citations: 840

h-index: 8

Kelvin Niu

Citations: 252

h-index: 5

Parth Pathak

Citations: 23

h-index: 3

Michael Shvartsman

Citations: 74

h-index: 5

Edan Toledo

Citations: 69

h-index: 4

Anton Protopopov

Citations: 16

h-index: 2

Jakob Foerster

Citations: 126

h-index: 4

Yoram Bachrach

Citations: 6,399

h-index: 42

Tatiana Shavrina

Citations: 264

h-index: 8

Karen Hambardzumyan

YerevaNN

Citations: 1,286

h-index: 9

N. Baldwin

Citations: 54

h-index: 2

Rishi Hazra

Doctoral student, AASS Research Centre, Örebro University (WASP graduate school)

Citations: 211

h-index: 8

Michael Kuchnik

Citations: 126

h-index: 4

Thom Foster

Citations: 12

h-index: 2

Hela Momand

Citations: 0

h-index: 0

Nicola Cancedda

Citations: 121

h-index: 4

Pontus Stenetorp

Citations: 57

h-index: 3

Carole-Jean Wu

Citations: 325

h-index: 7

기존 연구에서는 인공지능 연구 에이전트의 세 가지 구조적 성능 저하 요인이 밝혀졌습니다: (1) 동기식 단일 GPU 실행은 샘플 처리량을 제한하여 탐색의 이점을 감소시키고, (2) 검증 기반 선택은 탐색 범위가 확장됨에 따라 성능 저하를 유발하는 일반화 격차를 발생시키며, (3) 고정된 단일 턴 LLM(Large Language Model) 연산자의 제한된 기능은 탐색 성능의 상한선을 설정합니다. 우리는 이러한 저하 요인을 해결하기 위해 세 가지 아키텍처적 개선을 도입한 AIRA$_2$를 제안합니다: 비동기식 멀티 GPU 워커 풀은 실험 처리량을 선형적으로 증가시키고, Hidden Consistent Evaluation 프로토콜은 신뢰할 수 있는 평가 신호를 제공하며, ReAct 에이전트는 동적으로 행동 범위를 설정하고 대화형으로 디버깅합니다. MLE-bench-30 데이터셋에서 AIRA$_2$는 24시간 동안 평균 백분위수 순위 71.8%를 달성하여 이전 최고 기록인 69.9%를 능가하며, 72시간 후에는 꾸준히 76.0%로 향상됩니다. 분석 결과, 각 구성 요소가 필수적이며, 이전 연구에서 보고된

Original Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

0 Citations

0 Influential

21 Altmetric

105.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!