2601.19325v1 Jan 27, 2026 cs.CV

Innovator-VL: 과학적 발견을 위한 다중 모드 대규모 언어 모델

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Zichen Wen

Citations: 945

h-index: 16

Haoyi Tao

Citations: 24

h-index: 3

Han Lyu

Citations: 66

h-index: 3

Guolin Ke

Citations: 817

h-index: 13

Xi Fang

Citations: 61

h-index: 5

Nang Yuan

Citations: 9

h-index: 1

Zhen Wang

Citations: 387

h-index: 6

Xiaoxing Wang

Citations: 24

h-index: 3

E. Weinan

Citations: 188

h-index: 7

Yanfeng Wang

Citations: 250

h-index: 4

Boxue Yang

Citations: 43

h-index: 3

Shuang Chen

Citations: 78

h-index: 4

Yaojie Zhang

Citations: 181

h-index: 4

Yuhang Han

Citations: 78

h-index: 4

Junlong Ke

Citations: 27

h-index: 3

Cong Wang

Citations: 98

h-index: 5

Yicheng Fu

Citations: 24

h-index: 3

Jiawang Zhao

Citations: 47

h-index: 3

Jiangchao Yao

Citations: 3,031

h-index: 23

H. Cai

Citations: 116

h-index: 6

Linli Yao

Citations: 722

h-index: 8

Zhifeng Gao

Citations: 22

h-index: 3

Yanhui Hong

Citations: 71

h-index: 5

Yixuan Li

Citations: 10

h-index: 2

Guojiang Zhao

Citations: 54

h-index: 5

Nan Wang

Citations: 36

h-index: 5

Ning Liao

Citations: 13

h-index: 2

Kai Chen

Citations: 16

h-index: 3

Zhiyu Li

Citations: 42

h-index: 2

Feiyu Xiong

Citations: 1,112

h-index: 19

Sihan Hu

Citations: 20

h-index: 3

Kun Chen

Citations: 9

h-index: 1

Linfeng Zhang

Citations: 57

h-index: 3

본 논문에서는 과학적 이해와 추론 능력을 향상시키고, 일반적인 시각 작업에서도 뛰어난 성능을 유지하도록 설계된 다중 모드 대규모 언어 모델인 Innovator-VL을 소개합니다. 기존의 대규모 도메인 특화 사전 학습 및 불투명한 파이프라인에 의존하는 경향과 달리, 본 연구는 체계적인 학습 설계 및 투명한 방법론이 상당한 데이터 요구량 감소에도 불구하고 강력한 과학적 지능을 구현할 수 있음을 보여줍니다. (i) 첫째, 데이터 수집, 정제, 전처리, 지도 학습, 강화 학습, 평가를 포함하는 완전하고 투명하며 재현 가능한 학습 파이프라인과 상세한 최적화 방법을 제공하여, 커뮤니티의 체계적인 확장을 용이하게 합니다. (ii) 둘째, Innovator-VL은 뛰어난 데이터 효율성을 보여주며, 대규모 사전 학습 없이 5백만 개 미만의 선별된 데이터 샘플을 사용하여 다양한 과학적 작업에서 경쟁력 있는 성능을 달성합니다. 이러한 결과는 효과적인 추론이 무분별한 확장이 아닌 체계적인 데이터 선택을 통해 달성될 수 있음을 강조합니다. (iii) 셋째, Innovator-VL은 일반적인 시각, 다중 모드 추론 및 과학적 벤치마크에서 경쟁력 있는 성능을 보여주며, 이는 과학적 정렬을 일반적인 기능을 손상시키지 않고 통합된 모델에 통합할 수 있음을 나타냅니다. 본 연구의 결과는 대규모 데이터 없이도 효율적이고 재현 가능하며 고성능의 과학적 다중 모드 모델을 구축할 수 있음을 시사하며, 이는 향후 연구를 위한 실질적인 기반을 제공합니다.

Original Abstract

We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.

9 Citations

0 Influential

11.5 Altmetric

66.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!