2601.02908v1 Jan 06, 2026 cs.CV

TA-Prompting: 시간적 앵커를 활용하여 비디오 LLM의 밀집 비디오 캡셔닝 성능 향상

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Wei-Yuan Cheng

Citations: 3

h-index: 1

Kai-Po Chang

Citations: 135

h-index: 3

Chi-Pin Huang

Citations: 206

h-index: 4

Fu-En Yang

Citations: 655

h-index: 13

Yu-Chiang Frank Wang

Citations: 94

h-index: 4

밀집 비디오 캡셔닝은 입력 비디오 전체에 걸쳐 발생하는 모든 시간적으로 구체화된 사건을 해석하고 설명하는 것을 목표로 합니다. 최근의 최첨단 방법들은 상세한 순간 설명을 제공하기 위해 대규모 언어 모델(LLM)을 활용합니다. 그러나 기존의 비디오 LLM은 원본 비디오에서 정확한 사건 경계를 식별하는 데 어려움을 겪으며, 이는 생성된 캡션이 정확하게 내용과 연결되지 않게 만듭니다. 본 논문에서는 시간적 앵커(Temporal Anchors)를 활용하여 비디오 LLM을 향상시키는 TA-Prompting 방법을 제안합니다. TA-Prompting은 사건을 정확하게 위치시키고, 비디오 LLM이 시간 정보를 고려한 비디오 이벤트 이해를 수행하도록 유도합니다. 추론 과정에서, 비디오 내에 존재하는 다양한 수의 사건으로부터 적절한 출력 캡션 시퀀스를 결정하기 위해, 시간적으로 일관성이 있고 주어진 비디오와의 다중 모달 유사성이 충분한 이벤트 캡션을 선택하는 이벤트 일관성 샘플링 전략을 도입합니다. 벤치마크 데이터셋에 대한 광범위한 실험을 통해, TA-Prompting이 최첨단 비디오 LLM에 비해 밀집 비디오 캡셔닝 및 시간 이해 작업(예: 순간 검색, 시간 기반 질의 응답)에서 우수한 성능을 보임을 입증합니다.

Original Abstract

Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!