2603.29252v1 Mar 31, 2026 cs.CV

시각적 메모리 메커니즘을 통한 다중 모드 대규모 언어 모델의 장편 비디오 이해 성능 향상

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Yiyi Zhou

Citations: 2,878

h-index: 26

Xiaoshuai Sun

Citations: 1,038

h-index: 19

Rongrong Ji

Citations: 855

h-index: 16

Tao Chen

Citations: 11

h-index: 1

Chaokun Chang

Citations: 98

h-index: 2

Kun Zhang

Citations: 4

h-index: 2

Qiong Wu

Citations: 155

h-index: 5

Xiao Chen

Citations: 0

h-index: 0

장편 비디오 이해는 다중 모드 대규모 언어 모델(MLLM)의 발전을 저해하는 주요 과제입니다. 본 논문에서는 시각적 메모리 메커니즘의 관점에서 이 문제를 연구하고, 훈련 없이 적용 가능한 새로운 방법인 FlexMem을 제안합니다. FlexMem은 인간이 비디오를 시청하는 방식을 모방하여, 비디오 콘텐츠를 지속적으로 시청하고 질문에 대한 답변에 가장 관련성이 높은 기억 단편을 회상하는 것을 목표로 합니다. 이를 통해 FlexMem은 기존 방법과 달리 모든 비디오 정보를 한 번에 처리하는 방식의 입력 제한을 극복하고, 무한한 길이의 비디오 이해를 가능하게 합니다. 구체적으로, FlexMem은 시각적 KV 캐시를 메모리 소스로 활용하고, 이중 경로 압축 설계를 통해 효과적인 메모리 전송 및 쓰기를 구현합니다. 또한, FlexMem은 다양한 비디오 이해 작업에 적합한 다양한 메모리 읽기 전략을 탐색하며, 스트리밍 방식도 포함합니다. FlexMem의 성능을 검증하기 위해, 두 가지 인기 있는 비디오-MLLM에 적용하고, 다섯 가지 장편 비디오 작업과 하나의 스트리밍 비디오 작업에 대한 광범위한 실험을 수행했습니다. 실험 결과, FlexMem은 기존의 효율적인 비디오 이해 방법보다 눈에 띄는 성능 향상을 보여주며, 3090 GPU 하나로 1,000프레임 이상의 비디오를 처리할 수 있습니다. 또한, FlexMem은 GPT-4o 및 Gemini-1.5 Pro와 같은 최첨단 MLLM과 비교하여 일부 벤치마크에서 동등하거나 더 나은 성능을 달성하도록 기본 MLLM을 지원합니다.

Original Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!