2602.05847v2 Feb 05, 2026 cs.AI

OmniVideo-R1: 쿼리 의도 및 모달리티 주의를 활용한 오디오-비주얼 추론 강화

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Yihao Hu

Citations: 6

h-index: 1

Zhangquan Chen

Citations: 144

h-index: 7

Ruihuang Li

Citations: 48

h-index: 4

Zhantao Yang

Citations: 130

h-index: 4

Xinlei Yu

Citations: 235

h-index: 7

Jiale Tao

Citations: 138

h-index: 5

Haodong Jing

Citations: 102

h-index: 6

Manyuan Zhang

Citations: 25

h-index: 3

Shuai Shao

Citations: 10

h-index: 2

Biao Wang

Citations: 244

h-index: 8

Qinglin Lu

Citations: 30

h-index: 3

Ruitao Chen

Citations: 8

h-index: 1

Ruqi Huang

Citations: 154

h-index: 9

인간은 다양한 감각을 통해 세상을 인지하며, 이러한 감각들은 서로 협력하여 주변 환경에 대한 통합적인 이해를 돕습니다. 그러나 기존의 오디오-비주얼 모델은 여전히 오디오-비주얼 이해 작업에서 상당한 어려움을 겪고 있습니다. 본 논문에서는 혼합 모달리티 추론을 개선하는 새로운 강화 프레임워크인 OmniVideo-R1을 제안합니다. OmniVideo-R1은 모델이 '모든 모달리티의 단서를 활용하여 사고'하도록 두 가지 핵심 전략을 통해 지원합니다. (1) 자기 지도 학습 패러다임을 기반으로 한 쿼리 중심의 정보 연결 및 (2) 대조 학습 패러다임을 기반으로 한 모달리티 주의 융합입니다. 여러 벤치마크에 대한 광범위한 실험 결과, OmniVideo-R1은 강력한 기본 모델보다 일관되게 우수한 성능을 보이며, 이는 OmniVideo-R1의 효과성과 강력한 일반화 능력을 입증합니다.

Original Abstract

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

5 Citations

0 Influential

4.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!