2603.01455v1 Mar 02, 2026 cs.CV

세부 정보에서 핵심으로: 의미 기반 정보 병목 현상을 활용한 피라미드형 다중 모달 메모리, 장기 예측 비디오 에이전트를 위한 방법

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian

Citations: 42

h-index: 4

Hanshu Yao

Citations: 4

h-index: 1

Shu-Tao Xia

Citations: 350

h-index: 8

Jinpeng Wang

Citations: 595

h-index: 12

Yuting Wang

Citations: 79

h-index: 5

Bin Chen

Citations: 97

h-index: 6

Yaowei Wang

Citations: 141

h-index: 6

Min Zhang

Citations: 338

h-index: 9

다중 모달 대규모 언어 모델은 뛰어난 단기 추론 능력을 보여주지만, 제한된 문맥 창과 인간의 인지 효율성을 반영하지 못하는 정적인 메모리 메커니즘으로 인해 장기적인 비디오 이해에는 어려움을 겪습니다. 기존의 방법들은 주로 두 가지 극단으로 나뉩니다. 하나는 높은 지연 시간과 중복을 초래하는 시각 중심 방법이며, 다른 하나는 과도한 캡션을 통해 세부 정보를 잃고 환각을 일으키는 텍스트 중심 접근 방식입니다. 이러한 간극을 해소하기 위해, 우리는 퍼지-트레이스 이론에 기반한 피라미드형 다중 모달 메모리 아키텍처인 MM-Mem을 제안합니다. MM-Mem은 메모리를 감각 버퍼, 에피소드 스트림 및 기호 스키마로 계층적으로 구성하여, 세부적인 지각 정보를 고수준의 의미 스키마로 점진적으로 추출할 수 있도록 합니다. 또한, 메모리 구축 과정을 제어하기 위해, 우리는 의미 기반 정보 병목 현상(Semantic Information Bottleneck, SIB) 목표를 도출하고, 메모리 압축과 작업 관련 정보 유지 간의 균형을 최적화하기 위해 SIB-GRPO를 도입했습니다. 추론 과정에서, 우리는 엔트로피 기반의 상향식 메모리 검색 전략을 설계했습니다. 이 전략은 먼저 추상적인 기호 스키마를 사용하고, 불확실성이 높은 경우 점진적으로 감각 버퍼와 에피소드 스트림으로 세부 정보를 탐색합니다. 4개의 벤치마크에 대한 광범위한 실험 결과는 MM-Mem이 오프라인 및 스트리밍 작업 모두에서 효과적이며, 강력한 일반화 능력을 보여주며, 인지 기반의 메모리 구성 방식의 효과를 검증한다는 것을 확인했습니다. 코드: https://github.com/EliSpectre/MM-Mem

Original Abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.

4 Citations

0 Influential

34.95879734614 Altmetric

178.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!