2605.28035v1 May 27, 2026 cs.AI

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

Heyan Huang
Heyan Huang
Citations: 146
h-index: 7
Xian-Ling Mao
Xian-Ling Mao
Citations: 58
h-index: 4
Yue Liu
Yue Liu
Citations: 2
h-index: 1
Haitian Li
Haitian Li
Citations: 13
h-index: 2
Yanghao Zhou
Yanghao Zhou
Citations: 10
h-index: 1
Liang-Jin Chen
Liang-Jin Chen
Citations: 40
h-index: 1
Yiming Cheng
Yiming Cheng
Citations: 54
h-index: 2
Xu Liu
Xu Liu
Citations: 12
h-index: 3
Dian Jin
Dian Jin
Citations: 34
h-index: 3
Jiajun Xu
Jiajun Xu
Citations: 12
h-index: 2
Jing Liao
Jing Liao
Citations: 127
h-index: 4
Tian Lan
Tian Lan
Citations: 6
h-index: 2
Ziqi Zhou
Ziqi Zhou
Citations: 24
h-index: 3
Yu Bai
Yu Bai
Beijing Academy of Artificial Intelligence (BAAI)
Citations: 286
h-index: 9
Changsen Yuan
Changsen Yuan
Citations: 231
h-index: 7
Jinxing Zhou
Jinxing Zhou
Citations: 45
h-index: 5
Xuefeng Chen
Xuefeng Chen
Citations: 45
h-index: 2
Yousheng Feng
Yousheng Feng
Citations: 7
h-index: 1

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!