2605.26441v1 May 26, 2026 cs.CV

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Jianfeng Dong

Citations: 151

h-index: 6

Wanlong Fang

Citations: 380

h-index: 14

Xiang Fang

Citations: 852

h-index: 17

Zeyu Xiong

Citations: 112

h-index: 3

Xiaoye Qu

Citations: 2,121

h-index: 31

Chen Chen

Citations: 69

h-index: 2

Keke Tang

Citations: 339

h-index: 13

Pan Zhou

Citations: 172

h-index: 7

Yu Cheng

Citations: 3,095

h-index: 14

Daizong Liu

Citations: 2,351

h-index: 31

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

36 Citations

0 Influential

15.5 Altmetric

113.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!