2605.06177v1 May 07, 2026 cs.AI

BioMedArena: 생물의학 심층 연구 에이전트 구축 및 평가를 위한 오픈 소스 툴킷

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

David A. Clifton

Citations: 2

h-index: 1

Junde Wu

Citations: 934

h-index: 9

Jiazhen Pan

Citations: 564

h-index: 13

Jinge Wu

Citations: 125

h-index: 4

Min Zeng

Citations: 4

h-index: 1

Jiayuan Zhu

Citations: 899

h-index: 10

Sean Wu

Citations: 155

h-index: 4

Honghan Wu

Citations: 502

h-index: 10

Hongjian Zhou

Citations: 416

h-index: 6

Fenglin Liu

Citations: 42

h-index: 4

오늘날 심층 연구 에이전트를 구축하는 것은 여러 구성 요소를 연결하는 작업과 같습니다. 동일한 기반 모델을 동일한 벤치마크로 평가하더라도, 사용되는 환경 및 도구 등록 방식이 다르면 결과 정확도가 달라질 수 있으며, 새로운 기반 모델을 유사한 평가 환경에 통합하는 데는 모델별 맞춤 작업으로 인해 몇 주가 소요될 수 있습니다. 우리는 이러한 문제를 '논문별 엔지니어링 비용'이라고 부르며, BioMedArena를 출시합니다. BioMedArena는 이러한 문제를 완화하는 것은 물론, 심층 연구 에이전트로 평가될 때 다양한 기반 모델을 공정하게 비교할 수 있는 환경을 제공하는 오픈 소스 툴킷입니다. BioMedArena는 생물의학 에이전트 평가의 여섯 가지 핵심 요소(벤치마크 로딩, 도구 노출, 도구 선택, 실행 모드, 컨텍스트 관리, 점수 계산)를 분리하고, 9가지 기능 그룹에 걸쳐 147개의 생물의학 벤치마크와 75개의 생물의학 도구를 제공합니다. 새로운 모델, 벤치마크 또는 도구를 추가하는 작업은 몇 줄의 제공자 어댑터 등록으로 간단하게 처리됩니다. 또한, 6가지 컨텍스트 관리 전략을 가진 6가지 에이전트 환경을 제공하여, 경쟁력 있는 연구 능력을 갖춘 12개의 기반 모델을 제공하며, 성능이 크게 향상되어 8개의 대표적인 생물의학 벤치마크에서 최고 수준(SOTA)의 결과를 달성했습니다. 툴킷, 구성 파일 및 작업별 추적 정보는 https://github.com/AI-in-Health/BioMedArena 에서 확인할 수 있습니다.

Original Abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

0 Citations

0 Influential

40.951858789481 Altmetric

204.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!