2604.20806v1 Apr 22, 2026 cs.CV

OMIBench: 대규모 시각-언어 모델의 올림피아드 수준 다중 이미지 추론 성능 벤치마킹

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Yi Yang

Citations: 39

h-index: 3

Jingqi Tong

Citations: 293

h-index: 7

Qiguang Chen

SCIR

Citations: 1,830

h-index: 22

Libo Qin

Citations: 708

h-index: 10

Wanxiang Che

Citations: 1,283

h-index: 16

Chengyu Luan

Citations: 2

h-index: 1

Jiajun Wu

Citations: 482

h-index: 7

Qiming Yu

Citations: 29

h-index: 2

Yizhuo Li

Citations: 1

h-index: 1

Xiachong Feng

Citations: 412

h-index: 7

대규모 시각-언어 모델(LVLM)은 올림피아드 수준의 추론 작업에서 상당한 발전을 이루었습니다. 그러나 이러한 모델을 위한 현재의 올림피아드 수준 다중 모달 추론 벤치마크는 종종 단일 이미지 분석에 중점을 두며 여러 이미지에 걸친 맥락 정보를 활용하지 못합니다. 본 논문에서는 여러 이미지에 분산된 증거를 필요로 하는 올림피아드 수준 추론을 평가하기 위해 설계된 벤치마크인 OMIBench를 제시합니다. OMIBench는 생물학, 화학, 수학, 물리학 올림피아드 문제들을 포함하고 있으며, 정확한 답변과 의미론적 답변 매칭 모두에 대한 수동으로 주석이 달린 설명과 평가 프로토콜을 제공합니다. OMIBench에 대한 광범위한 실험을 통해 기존 모델 간에 의미 있는 성능 격차가 있음을 확인했습니다. 심지어 Gemini-3-Pro와 같은 가장 강력한 LVLM조차도 벤치마크에서 약 50%의 성능을 보였습니다. 이러한 결과는 OMIBench를 LVLM의 다중 이미지 추론 연구 및 개선을 위한 중요한 자료로 자리매김합니다.

Original Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

1 Citations

0 Influential

11 Altmetric

56.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!