2604.06685v1 Apr 08, 2026 cs.CL

ChemVLR: 화학 시각-언어 이해를 위한 시각 정보 처리 과정에서의 추론 우선화

ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

Xuanle Zhao

Citations: 22

h-index: 3

Xinyu Cai

Citations: 677

h-index: 12

Xiang Cheng

Citations: 12

h-index: 2

Xiuyi Chen

Citations: 287

h-index: 3

Bo Xu

Citations: 12

h-index: 3

시각-언어 모델(VLMs)은 화학 시각 이해 분야에서 상당한 잠재력을 보여주었지만, 현재 모델들은 주로 직접적인 시각 질의응답 작업에 최적화되어 있습니다. 이러한 방식은 종종 '블랙박스' 시스템을 초래하며, 대규모 언어 모델(LLMs)이 가지고 있는 반응 메커니즘 추론 능력을 충분히 활용하지 못합니다. 본 연구에서는 시각 정보 처리 과정에서 추론을 우선적으로 고려하도록 설계된 화학 VLM인 ChemVLR을 소개합니다. 기존의 화학 VLM과는 달리, ChemVLR은 답변을 생성하기 전에 작용기 등과 같은 세부적인 화학적 특징을 명시적으로 식별하여 시각 정보를 미세하게 분석합니다. 이러한 접근 방식은 복잡한 시각적 화학 문제에 대해 명확하고 해석 가능한 추론 경로를 제공합니다. 이러한 방법론을 지원하기 위해, 우리는 대규모의 추론 및 캡셔닝 데이터셋을 구축하기 위해, 교차 모달 역공학 전략과 엄격한 필터링 파이프라인을 결합했습니다. 이 데이터셋은 분자 및 반응 작업에 걸쳐 76만 개의 고품질 샘플로 구성되어 있습니다. 또한, 모델의 시각 인식 및 추론 능력을 체계적으로 향상시키기 위해 세 단계로 구성된 훈련 프레임워크를 채택했습니다. 실험 결과, ChemVLR은 선도적인 독점 모델 및 도메인별 오픈 소스 기준 모델을 능가하는 최첨단(SOTA) 성능을 달성했습니다. 또한, 훈련 전략 및 데이터 생성 설계의 유효성을 검증하기 위해 포괄적인 분석 연구를 수행했습니다. 코드 및 모델 가중치는 https://github.com/xxlllz/ChemVLR 에서 제공됩니다.

Original Abstract

While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.

0 Citations

0 Influential

29.4657359028 Altmetric

147.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!