2602.09163v1 Feb 09, 2026 cs.AI

FlyAOC: Drosophila 과학 지식 기반의 자율적 온톨로지 구축 평가

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Xingjian Zhang

Citations: 138

h-index: 5

Sophia Moylan

Citations: 0

h-index: 0

Ziyang Xiong

Citations: 18

h-index: 1

Qiaozhu Mei

Citations: 755

h-index: 11

Yichen Luo

Citations: 97

h-index: 5

Jiaqi W. Ma

University of Illinois Urbana-Champaign

Citations: 3,283

h-index: 19

과학 지식 기반은 연구 결과를 구조화되고 검색 가능한 형식으로 정리하여 인간 연구자와 인공지능 시스템 모두에게 활용 가능한 정보를 제공함으로써 연구 발전을 가속화합니다. 이러한 자원을 유지하기 위해서는 전문가가 관련 논문을 검색하고, 문서 간의 증거를 조율하며, 온톨로지 기반 주석을 생성하는 과정이 필요합니다. 그러나 기존의 벤치마크는 개별적인 하위 작업(예: 개체명 인식 또는 관계 추출)에 초점을 맞추고 있어 이러한 전체적인 워크플로우를 반영하지 못합니다. 본 논문에서는 과학 문헌에서 AI 에이전트의 엔드투엔드 자율적 온톨로지 구축 능력을 평가하기 위한 FlyBench를 소개합니다. FlyBench는 에이전트에게 유전자 기호만 제공하고, 16,898개의 전체 텍스트 논문으로 구성된 데이터베이스에서 검색하고 내용을 파악하여 구조화된 주석을 생성하도록 합니다. 이 벤치마크에는 FlyBase (과일파리 지식 기반)에서 추출한 100개의 유전자와 관련된 7,397개의 전문가가 직접 작성한 주석이 포함되어 있습니다. 우리는 네 가지 기본 에이전트 아키텍처(기억 기반, 고정 파이프라인, 단일 에이전트, 다중 에이전트)를 평가했습니다. 결과적으로 아키텍처 선택이 성능에 큰 영향을 미치는 것을 확인했으며, 다중 에이전트 설계가 단순한 대안보다 우수한 성능을 보였습니다. 하지만 기반 모델의 규모를 확대하는 것은 성능 향상에 한계가 있었습니다. 모든 기본 모델은 개선의 여지가 큽니다. 분석 결과, 에이전트는 주로 새로운 정보를 발견하기보다는 검색을 통해 기존의 지식을 확인하는 데 사용된다는 점을 발견했습니다. 우리는 FlyBench가 과학적 추론 능력을 향상시키는 데 기여하고, 이는 다양한 과학 분야에 폭넓게 적용될 수 있기를 바랍니다.

Original Abstract

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!