2602.13758v1 Feb 14, 2026 cs.CV

OmniScience: 과학 이미지 이해를 위한 대규모 다중 모드 데이터셋

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

Haoyi Tao

Citations: 15

h-index: 2

Chaozheng Huang

Citations: 3

h-index: 1

Nan Wang

Citations: 3

h-index: 1

Han Lyu

Citations: 50

h-index: 2

Linfeng Zhang

Citations: 979

h-index: 10

Guolin Ke

Citations: 687

h-index: 12

Xi Fang

Citations: 45

h-index: 5

다중 모드 대규모 언어 모델(MLLM)은 자연 이미지 이해 분야에서 뛰어난 성능을 보이지만, 개략도, 실험 결과, 분석 차트 등 과학 이미지를 해석하는 능력은 제한적입니다. 이러한 한계는 특히 오픈 소스 MLLM에서 두드러지게 나타납니다. 이러한 격차는 주로 기존 데이터셋이 제한적인 분야를 다루고, 세부적인 구조 정보가 부족하며, 의미론적 연결성이 약하기 때문입니다. 본 논문에서는 150만 개의 그림-설명-문맥 묶음으로 구성된 대규모 고품질 다중 모드 데이터셋인 OmniScience를 소개합니다. OmniScience는 10가지 이상의 주요 과학 분야를 포괄합니다. 다중 모드 대규모 모델 훈련을 위한 정보 밀도가 높고 정확한 이미지 설명 데이터를 얻기 위해, 최첨단 다중 모드 대규모 언어 모델을 활용하여 시각적 특징, 원본 그림 설명, 그리고 인간 과학자가 작성한 관련 텍스트 참조를 함께 합성하여 상세하고 완전한 설명을 생성하는 동적 모델 라우팅 재-캡션 생성 파이프라인을 개발했습니다. 이 파이프라인은 엄격한 품질 필터링과 인간 전문가의 판단과의 일치성을 통해 사실 정확성과 의미론적 완전성을 보장하며, 이미지-텍스트 다중 모드 유사성 점수를 0.769에서 0.956으로 향상시킵니다. 또한, 시각적 이해를 평가하기 위한 캡션 질의응답 프로토콜을 제안합니다. 이 설정에서, OmniScience를 사용하여 미세 조정된 Qwen2.5-VL-3B 모델은 기준 모델보다 상당한 성능 향상을 보이며, MM-MT-Bench에서 0.378의 성능 향상, MMMU에서 0.140의 성능 향상을 달성했습니다.

Original Abstract

Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!