2606.09169v1 Jun 08, 2026 cs.AI

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Weitong Lian

Citations: 2

h-index: 1

Zecong Tang

Citations: 10

h-index: 2

L. Meng

Citations: 3

h-index: 1

Tengju Ru

Citations: 2

h-index: 1

Zhejun Cui

Citations: 2

h-index: 1

Qi Kang

Citations: 2

h-index: 1

Yu Zhang

Citations: 2

h-index: 1

Kaixuan Wang

Citations: 36

h-index: 3

Yechi Liu

Citations: 3

h-index: 1

Haoran Li

Citations: 7

h-index: 2

Hang Cao

Citations: 30

h-index: 3

Yichen Zhu

Citations: 407

h-index: 3

Yutao Yuan

Citations: 0

h-index: 0

Chunwei Wang

Citations: 236

h-index: 9

Bo Dai

Citations: 148

h-index: 6

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!