2602.02320v2 Feb 02, 2026 cs.CL

규칙 기반 정규화 방법을 이용한 분자 구조-언어 설명 대규모 데이터셋

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai

Citations: 32

h-index: 3

G. He

Citations: 439

h-index: 7

Yi Hu

Citations: 13

h-index: 2

Jingjing Wang

Citations: 17

h-index: 3

Joshua Luo

Citations: 45

h-index: 3

Tianyu Zhu

Citations: 22

h-index: 2

Srikanth Pilla

Citations: 1,673

h-index: 23

Gang Li

Citations: 30

h-index: 3

Ling Liu

Citations: 63

h-index: 4

Feng Luo

Citations: 10

h-index: 2

분자의 기능은 주로 구조에 의해 결정됩니다. 따라서 분자 구조를 자연어와 정확하게 연결하는 것은 대규모 언어 모델(LLM)이 화학 관련 작업을 수행하는 데 필수적입니다. 그러나 인간의 주석 작업에 드는 상당한 비용으로 인해 대규모의 고품질 구조 기반 설명 데이터셋을 구축하는 것은 어렵습니다. 본 연구에서는 분자 구조에 대한 정확한 설명을 대규모로 생성하기 위한 완전 자동화된 주석 프레임워크를 제안합니다. 우리의 접근 방식은 규칙 기반의 화학 명명법 파서를 기반으로 하며, IUPAC 명칭을 해석하고 분자 구조를 명시적으로 인코딩하는 풍부하고 구조화된 XML 메타데이터를 생성합니다. 이 메타데이터는 LLM이 정확한 자연어 설명을 생성하도록 안내하는 데 사용됩니다. 이 프레임워크를 사용하여 약 163,000개의 분자-설명 쌍으로 구성된 대규모 데이터셋을 구축했습니다. 2,000개의 분자를 대상으로 LLM 기반 평가와 전문가 인간 평가를 결합한 엄격한 검증 프로토콜을 통해 98.6%의 높은 설명 정확도를 확인했습니다. 결과적으로 생성된 데이터셋은 향후 분자-언어 연결 연구를 위한 신뢰할 수 있는 기반을 제공하며, 제안된 주석 방법은 더 큰 데이터셋과 구조적 설명을 필요로 하는 다양한 화학 관련 작업에 쉽게 적용할 수 있습니다.

Original Abstract

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!