2605.07897v1 May 08, 2026 cs.CV

의미 정보를 활용한 적응형 시각 메모리: 스트리밍 비디오 이해를 위한 방법

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

Yiwei Wang

Citations: 229

h-index: 9

Yujun Cai

Citations: 3,844

h-index: 21

Hang Wu

Citations: 115

h-index: 5

Sherin M. Mathews

Citations: 522

h-index: 7

Ming-Hsuan Yang

Citations: 71

h-index: 4

온라인 스트리밍 비디오 이해는 모델이 지속적인 시각 정보를 처리하고 실시간으로 사용자 쿼리에 응답해야 하는 기술입니다. 무한한 스트림과 예측 불가능한 쿼리 시간으로 인해 메모리 관리는 핵심적인 과제로 떠오릅니다. 기존 방법들은 일반적으로 시각적 유사성 규칙을 통해 시각 토큰을 압축하거나, 압축과 함께 KV 캐시 수준의 검색을 활용합니다. 그러나 압축 결정은 종종 의미 정보를 고려하지 않으며, 검색은 종종 압축이 완료된 후에 추가되어 두 단계 간의 조율이 어렵습니다. 본 논문에서는 의미 정보를 메모리 생성 과정에 통합하고, 각 쿼리에 따라 검색 범위를 적응적으로 조절하는 훈련이 필요 없는 이중 단계 프레임워크인 SAVEMem을 제안합니다. 1단계에서 SAVEMem은 고정된 메모리 예산 하에서 3단계 스트리밍 메모리를 실시간으로 구축합니다. 고정된 가상 질문 데이터베이스는 가벼운 의미 기반 정보를 제공하여, 장기적인 정보 저장은 시각적 유사성뿐만 아니라 의미적 중요성에 의해 결정됩니다. 2단계에서 SAVEMem은 이 메모리에 대해 쿼리 기반 검색을 수행합니다. 앵커 조건에 기반한 최근성 게이트는 쿼리가 현재 시점 또는 과거 시점에 초점을 맞추는지에 따라 검색 범위를 단기, 중장기 메모리로 적응적으로 조절합니다. 이 범위 내에서 쿼리와 메모리 토큰 간의 상호 작용을 통해 답변에 적합한 프레임을 선택합니다. SAVEMem은 훈련 없이 Qwen2.5-VL 모델에 적용되었을 때, OVO-Bench의 전체 점수를 52.27에서 62.69로 향상시켰으며, StreamingBench 및 ODV-Bench에서도 일관된 성능 향상을 보였습니다. 또한, 128프레임 기준으로 GPU 메모리 사용량을 48% 줄였습니다.

Original Abstract

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!