2604.21508v1 Apr 23, 2026 cs.AI

BioMiner: 문헌에서 단백질-리간드 생체 활성 데이터 자동 추출을 위한 다중 모드 시스템

BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

Enhong Chen

Citations: 55

h-index: 5

Xukai Liu

Citations: 88

h-index: 6

Kai Zhang

Citations: 14

h-index: 3

Yuhang Yang

Citations: 4

h-index: 1

Jiaxian Yan

Citations: 139

h-index: 5

Jintao Zhu

Citations: 82

h-index: 6

Qi Liu

Citations: 6

h-index: 2

Zaixin Zhang

Citations: 1,174

h-index: 12

Boyan Zhang

Citations: 42

h-index: 3

Kaiyuan Gao

Citations: 5

h-index: 2

Jinchuan Xiao

Citations: 9

h-index: 2

문헌에 게재된 단백질-리간드 생체 활성 데이터는 신약 개발에 필수적이지만, 급증하는 문헌의 양을 감당하기에는 수동적인 데이터 큐레이션 방식으로는 한계가 있습니다. 자동화된 생체 활성 추출은 텍스트, 표, 그림에 분산된 생화학적 의미를 해석하는 것뿐만 아니라, 화학적으로 정확한 리간드 구조(예: 마르쿠쉬 구조)를 재구성해야 하므로 여전히 어려운 과제입니다. 이러한 문제점을 해결하기 위해, 저희는 생체 활성 의미 해석과 리간드 구조 구축을 명시적으로 분리하는 다중 모드 추출 프레임워크인 BioMiner를 소개합니다. BioMiner 내에서, 생체 활성 의미는 직접적인 추론을 통해 파악하며, 화학 구조는 화학적으로 기반을 둔 시각적 의미 추론 패러다임을 통해 해결됩니다. 이 패러다임에서는 다중 모드 대규모 언어 모델이 화학적으로 기반을 둔 시각적 표현을 사용하여 구조 간의 관계를 추론하고, 정확한 분자 구조 구축은 해당 분야의 화학 도구에 위임됩니다. 엄격한 평가 및 방법론 개발을 위해, 저희는 500개의 논문에서 큐레이션된 16,457개의 생체 활성 항목으로 구성된 종합적인 벤치마크인 BioVista를 추가적으로 구축했습니다. BioMiner는 추출 능력을 검증하고 정량적인 기준을 제공하며, 생체 활성 트리플렛에 대해 F1 점수가 0.32를 달성했습니다. BioMiner의 실용적인 유용성은 세 가지 응용 사례를 통해 입증되었습니다. (1) 11,683개의 논문에서 82,262개의 데이터를 추출하여 사전 학습 데이터베이스를 구축함으로써, 다운스트림 모델의 성능을 3.9% 향상시켰습니다. (2) 인간-루프 워크플로우를 통해 NLRP3 생체 활성 데이터의 고품질 항목 수를 두 배로 늘렸으며, 28개의 QSAR 모델에 대해 38.6%의 성능 향상을 보았고, 새로운 골격을 가진 16개의 후보 물질을 식별했습니다. (3) 단백질-리간드 복합체의 생체 활성 주석 작업을 가속화하여, PoseBusters 데이터셋에서 수동 작업 대비 5.59배의 속도 향상과 5.75%의 정확도 향상을 달성했습니다.

Original Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.

3 Citations

1 Influential

6 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!