2606.09169v1 Jun 08, 2026 cs.AI

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Weitong Lian
Weitong Lian
Citations: 2
h-index: 1
Zecong Tang
Zecong Tang
Citations: 10
h-index: 2
L. Meng
L. Meng
Citations: 3
h-index: 1
Tengju Ru
Tengju Ru
Citations: 2
h-index: 1
Zhejun Cui
Zhejun Cui
Citations: 2
h-index: 1
Qi Kang
Qi Kang
Citations: 2
h-index: 1
Yu Zhang
Yu Zhang
Citations: 2
h-index: 1
Kaixuan Wang
Kaixuan Wang
Citations: 36
h-index: 3
Yechi Liu
Yechi Liu
Citations: 3
h-index: 1
Haoran Li
Haoran Li
Citations: 7
h-index: 2
Hang Cao
Hang Cao
Citations: 30
h-index: 3
Yichen Zhu
Yichen Zhu
Citations: 407
h-index: 3
Yutao Yuan
Yutao Yuan
Citations: 0
h-index: 0
Chunwei Wang
Chunwei Wang
Citations: 236
h-index: 9
Bo Dai
Bo Dai
Citations: 148
h-index: 6

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!