2604.27389v1 Apr 30, 2026 cs.CV

COHERENCE: 엇갈린 다중 모드 환경에서 세밀한 이미지-텍스트 정렬 성능 평가

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Wei Wang

Citations: 0

h-index: 0

Lei Feng

Citations: 20

h-index: 2

Qipeng Guo

Citations: 1,190

h-index: 8

Zhishan Lin

Citations: 181

h-index: 4

Kai Chen

Citations: 65

h-index: 3

Lixin Gu

Citations: 3,410

h-index: 6

Huanze Tang

Citations: 459

h-index: 4

Haijun Lv

Citations: 514

h-index: 6

최근 몇 년 동안, 다중 모드 대규모 언어 모델(MLLM)은 다양한 다중 모드 벤치마크에서 놀라운 발전을 이루었습니다. 그러나 이러한 발전에도 불구하고, 대부분의 기존 벤치마크는 단일 이미지 또는 다중 이미지 이해에 주로 초점을 맞추고 있습니다. 실제 시나리오, 특히 문서 읽기와 같이, 정보는 종종 엇갈린 다중 모드 환경으로 제시됩니다. 이는 MLLM이 개별 이미지의 내용을 인식하는 것뿐만 아니라, 관련 텍스트 및 시각적 증거를 식별하고, 이들 간의 세밀한 정렬을 확립하며, 맥락적 증거를 기반으로 엇갈린 환경에서 이러한 정렬된 정보를 추론할 수 있도록 요구합니다. 그러나 엇갈린 이미지-텍스트 환경에서 MLLM의 세밀한 이해 능력을 정량화하기 위한 체계적인 벤치마크는 여전히 부족합니다. 이러한 격차를 해소하기 위해, 우리는 엇갈린 다중 모드 환경에서 MLLM이 세밀한 이미지-텍스트 대응 관계를 파악하는 능력을 평가하도록 설계된 벤치마크인 COHERENCE를 제안합니다. COHERENCE는 네 가지 대표적인 도메인의 엇갈린 이미지-텍스트 콘텐츠를 포함하며, 6,161개의 고품질 질문으로 구성되어 있습니다. 또한, 우리는 6가지 유형의 오류 분석을 수행하여, 현재 MLLM에 부족한 특정 능력으로 인해 발생하는 엇갈린 이미지-텍스트 이해 실패의 원인을 세밀하게 분석할 수 있도록 합니다.

Original Abstract

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!