2602.02905v1 Feb 02, 2026 cs.AI

FIRE-Bench: 과학적 통찰력 재발견을 통한 에이전트 평가

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Eric P. Xing

Citations: 353

h-index: 7

Zhen Wang

Citations: 3

h-index: 1

Jieyuan Liu

Citations: 4

h-index: 1

Zhiting Hu

Citations: 30

h-index: 3

Fan Bai

Johns Hopkins University

Citations: 355

h-index: 7

Zhongyan Luo

Citations: 5

h-index: 1

Jinyan Su

Citations: 30

h-index: 3

Xinle Yu

Citations: 18

h-index: 2

Kun Zhou

Citations: 50

h-index: 3

C. Cardie

Citations: 3,034

h-index: 14

M. Dredze

Citations: 3,361

h-index: 21

Kaiser Sun

Johns Hopkins University

Citations: 212

h-index: 6

대규모 언어 모델(LLM) 기반의 자율 에이전트는 과학적 발견 과정을 전반적으로 가속화할 수 있는 잠재력을 가지고 있지만, 이러한 에이전트의 검증 가능한 발견 능력을 엄격하게 평가하는 것은 여전히 중요한 과제입니다. 기존의 벤치마크는 자동 생성된 연구 결과에 대한 LLM 기반 평가에 크게 의존하거나, 과학적 통찰력을 대략적으로 나타내는 편리하지만 고립된 성능 지표를 최적화하는 경향이 있습니다. 이러한 문제점을 해결하기 위해, 우리는 FIRE-Bench(Full-cycle Insight Rediscovery Evaluation, 완전 주기의 통찰력 재발견 평가)라는 벤치마크를 소개합니다. FIRE-Bench는 에이전트가 최신, 고 영향력의 머신러닝 연구에서 확립된 결과를 재발견하는 과정을 통해 에이전트를 평가합니다. 에이전트는 발표된, 검증된 연구에서 추출된 고수준의 연구 질문만 받아서, 자율적으로 아이디어를 탐색하고, 실험을 설계하고, 코드를 구현하고, 계획을 실행하고, 경험적 증거에 의해 뒷받침되는 결론을 도출해야 합니다. 우리는 FIRE-Bench를 사용하여 gpt-5와 같은 최첨단 LLM 기반 에이전트를 다양한 방식으로 평가했습니다. 우리의 결과는 현재 에이전트 시스템에게 완전한 주기의 과학적 연구가 여전히 어려운 과제임을 보여줍니다. 가장 강력한 에이전트조차도 제한적인 재발견 성공률(<50 F1)을 보이며, 실행 간에 높은 변동성을 보이고, 실험 설계, 실행 및 증거 기반 추론에서 반복적인 실패 모드를 나타냅니다. FIRE-Bench는 신뢰할 수 있는 에이전트 기반 과학적 발견으로 나아가는 진행 상황을 측정하기 위한 엄격하고 진단적인 프레임워크를 제공합니다.

Original Abstract

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

2 Citations

0 Influential

10.5 Altmetric

54.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!