2602.13191v1 Feb 13, 2026 cs.CV

CoPE-VideoLM: 효율적인 비디오 언어 모델을 위한 코덱 기본 요소

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar

Stanford University

Citations: 336

h-index: 8

Rémi Pautrat

Citations: 670

h-index: 11

O. Mikšík

Citations: 6,437

h-index: 23

Iro Armeni

Citations: 200

h-index: 5

Mahdi Rad

Citations: 230

h-index: 5

Mihai Dusmanu

Citations: 1,442

h-index: 10

Marc Pollefeys

Citations: 38

h-index: 3

비디오 언어 모델(VideoLM)은 AI 시스템이 비디오의 시간적 동역학을 이해하도록 지원합니다. 현재 방법은 최대 컨텍스트 윈도우 제약 조건에 맞추기 위해 키 프레임 샘플링을 사용하지만, 이는 희소한 시간적 커버리지로 인해 거시 수준의 이벤트와 미시 수준의 세부 정보를 모두 놓칠 수 있습니다. 또한, 각 프레임에 대해 전체 이미지와 해당 토큰을 처리하는 것은 상당한 계산 오버헤드를 발생시킵니다. 이러한 제한 사항을 해결하기 위해, 우리는 비디오의 중복성과 희소성을 자연스럽게 인코딩하며 대부분의 프레임에 대해 비용이 많이 드는 전체 이미지 인코딩이 필요하지 않은 비디오 코덱 기본 요소(특히 모션 벡터 및 잔차)를 활용하는 방법을 제안합니다. 이를 위해, 우리는 코덱 기본 요소를 집계하고 사전 훈련 전략을 통해 이미지 인코더 임베딩과 해당 표현을 정렬하는 경량 트랜스포머 기반 인코더를 도입하여 엔드투엔드 미세 조정 과정에서 수렴 속도를 가속화합니다. 우리의 접근 방식은 표준 VideoLM과 비교하여 첫 번째 토큰 생성 시간을 최대 86% 단축하고 토큰 사용량을 최대 93% 줄입니다. 또한, 키 프레임과 코덱 기본 요소의 밀도를 변경함으로써 일반적인 질문 답변, 시간적 추론, 장문 이해 및 공간 장면 이해를 포함하는 14개의 다양한 비디오 이해 벤치마크에서 성능을 유지하거나 능가할 수 있습니다.

Original Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

4 Citations

0 Influential

11.5 Altmetric

61.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!