2602.01801v1 Feb 02, 2026 cs.CV

시간적 캐시 압축 및 희소 어텐션을 활용한 빠른 자기 회귀 비디오 확산 모델 및 월드 모델

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Gal Chechik

Citations: 1,859

h-index: 19

Issar Tzachor

Citations: 52

h-index: 3

D. Samuel

Citations: 372

h-index: 8

Rami Ben-Ari

Citations: 73

h-index: 4

Matan Levy

PhD Student at The Hebrew University of Jerusalem

Citations: 150

h-index: 6

Micah Green

Citations: 0

h-index: 0

자기 회귀 비디오 확산 모델은 스트리밍 생성 기능을 가능하게 하여, 장편 영상 생성, 비디오 월드 모델, 그리고 인터랙티브 신경망 게임 엔진 개발의 길을 열어줍니다. 그러나 이러한 모델의 핵심 어텐션 레이어는 추론 시 주요 병목 현상으로 작용합니다. 생성 과정이 진행됨에 따라 KV 캐시가 증가하여 지연 시간이 늘어나고 GPU 메모리 사용량이 증가하며, 이는 사용 가능한 시간적 컨텍스트를 제한하고 장거리 일관성을 저해합니다. 본 연구에서는 자기 회귀 비디오 확산 모델에서 발생하는 중복성을 분석하고, 프레임 간 유사한 캐시 키, 느리게 변화하는 (주로 의미론적인) 쿼리/키로 인해 발생하는 불필요한 어텐션 연산, 그리고 긴 프롬프트에 대한 크로스 어텐션에서 발생하는 불필요한 토큰 사용이라는 세 가지 주요 원인을 파악했습니다. 이러한 관찰을 바탕으로, 자기 회귀 확산 모델을 위한 훈련이 필요 없는 통합 어텐션 프레임워크를 제안합니다. TempCache는 시간적 대응성을 활용하여 KV 캐시를 압축하고 캐시 증가를 제한하며, AnnCA는 프레임 관련 프롬프트 토큰을 빠르게 선택하여 크로스 어텐션을 가속화합니다. 또한, AnnSA는 각 쿼리를 의미론적으로 일치하는 키로 제한하여 자기 어텐션을 희소화합니다 (경량 ANN 사용). 이러한 모듈들은 어텐션 연산, 계산량, 그리고 메모리 사용량을 줄이며, 기존의 자기 회귀 확산 모델 및 월드 모델과 호환됩니다. 실험 결과, 전체적인 속도가 최대 5배에서 10배 향상되는 것을 확인했으며, 거의 동일한 시각적 품질을 유지하면서 장시간 실행 시에도 안정적인 처리량과 거의 일정한 최대 GPU 메모리 사용량을 유지했습니다. 기존 방법은 실행 시간이 점차 느려지고 메모리 사용량이 증가하는 경향이 있었습니다.

Original Abstract

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!