2603.25253v1 Mar 26, 2026 cs.CL

MolQuest: 화학 구조 규명에 대한 추론 능력 평가를 위한 벤치마크

MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Yuhao Zhou

Citations: 63

h-index: 6

Shuang Wu

Citations: 14

h-index: 2

Jinghan Wang

Citations: 12

h-index: 2

Renquan Lv

Citations: 6

h-index: 2

Bing Zhao

Citations: 7

h-index: 1

Wei Hu

Citations: 6

h-index: 2

Tao Han

Citations: 3

h-index: 1

대규모 언어 모델(LLM)은 과학적 발견을 발전시키는 데 상당한 잠재력을 가지고 있지만, 실제 연구 환경에서의 동적인 추론 능력을 체계적으로 평가하는 것은 여전히 제한적입니다. 현재의 과학적 평가 벤치마크는 주로 정적인, 단일 턴 질의응답(QA) 형식을 사용하는데, 이는 다단계 반복과 실험적 상호 작용을 필요로 하는 복잡한 과학적 작업에서 모델의 성능을 측정하기에는 부적합합니다. 이러한 격차를 해결하기 위해, 우리는 실제 화학 실험 데이터를 기반으로 구축된 분자 구조 규명에 대한 새로운 에이전트 기반 평가 프레임워크인 MolQuest를 소개합니다. 기존 데이터 세트와 달리, MolQuest는 분자 구조 규명을 다단계 상호 작용 작업으로 공식화하며, 모델이 실험 단계를 적극적으로 계획하고, 다양한 스펙트럼 데이터(예: NMR, MS)를 통합하고, 구조 가설을 반복적으로 개선하도록 요구합니다. 이 프레임워크는 LLM의 추론 능력과 전략적 의사 결정 능력을 광범위하고 복잡한 화학 공간 내에서 체계적으로 평가합니다. 실험 결과는 최첨단 모델조차도 실제 과학적 시나리오에서 상당한 한계를 가지고 있음을 보여줍니다. 특히, 최첨단 모델조차 약 50%의 정확도를 달성하는 데 그치며, 대부분의 다른 모델은 30% 미만의 성능을 보입니다. 이 연구는 과학 지향적인 LLM 평가를 위한 재현 가능하고 확장 가능한 프레임워크를 제공하며, 현재 LLM의 전략적 과학적 추론 능력에 존재하는 중요한 격차를 강조하고, 과학적 과정에 적극적으로 참여할 수 있는 AI를 향한 미래 연구의 방향을 제시합니다.

Original Abstract

Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!