2603.22121v1 Mar 23, 2026 cs.CV

Mamba-VMR: 생성된 비디오를 활용한 다중 모드 쿼리 증강을 통한 정밀한 시간적 위치 파악

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Nan Wu

Citations: 10

h-index: 2

Xinyue Liu

Citations: 32

h-index: 4

Linlin Zong

Citations: 1,206

h-index: 15

Yunzhuo Sun

Citations: 129

h-index: 6

Yanyang Li

Citations: 7

h-index: 1

Yifan Xu

Citations: 36

h-index: 3

Xianchao Zhang

Citations: 394

h-index: 10

Wenxin Liang

Citations: 22

h-index: 3

텍스트 기반 비디오 순간 검색(VMR)은 원본 비디오에 내재된 시간적 동역학을 충분히 반영하지 못하여 긴 시퀀스에서 부정확한 위치 파악 문제를 야기합니다. 기존 방법들은 자연어 질의(NLQ)나 정적 이미지 증강에 의존하며, 움직임 시퀀스를 고려하지 못하고 Transformer 기반 아키텍처에서 높은 계산 비용을 초래합니다. 기존 접근 방식은 자막의 맥락과 생성된 시간적 정보를 효과적으로 통합하지 못하므로, 본 연구에서는 향상된 시간적 위치 파악을 위한 새로운 2단계 프레임워크를 제안합니다. 첫 번째 단계에서는 LLM 기반 자막 매칭을 통해 비디오 자막에서 관련 텍스트 단서를 식별하고, 이를 질의와 결합하여 텍스트-비디오 모델을 통해 보조적인 짧은 비디오를 생성합니다. 이를 통해 암시적인 움직임 정보를 시간적 선행 정보로 활용합니다. 두 번째 단계에서는 증강된 질의를 다중 모드 제어 Mamba 네트워크를 통해 처리하며, 텍스트 기반 선택을 확장하여 비디오 기반 게이팅을 통해 생성된 선행 정보와 긴 시퀀스를 효율적으로 융합하고 노이즈를 제거합니다. 본 프레임워크는 기본 검색 모델에 독립적이며, 다중 모드 VMR에 광범위하게 적용될 수 있습니다. TVR 벤치마크에서의 실험 결과는 최첨단 방법보다 상당한 성능 향상을 보여주며, 계산 비용 감소와 긴 시퀀스에서의 높은 재현율을 달성했습니다.

Original Abstract

Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!