2605.28192v1 May 27, 2026 cs.AI

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

Yu Wang
Yu Wang
Citations: 58
h-index: 2
Yanfeng Wang
Yanfeng Wang
Citations: 366
h-index: 10
Ke Xu
Ke Xu
Citations: 13
h-index: 2
Ziyang Cheng
Ziyang Cheng
Citations: 64
h-index: 4
Hongcheng Liu
Hongcheng Liu
Citations: 171
h-index: 6
Yuhao Wang
Yuhao Wang
Citations: 4
h-index: 1

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

0 Citations
0 Influential
5 Altmetric
25.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!