2605.14513v1 May 14, 2026 cs.CV

HASTE: 헤드 단위 적응형 희소 어텐션을 이용한 학습 불필요 비디오 디퓨전 가속화

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Yuexiao Ma

Citations: 223

h-index: 7

Xuzhe Zheng

Citations: 405

h-index: 5

Xiawu Zheng

Citations: 7,054

h-index: 27

Fei Chao

Citations: 141

h-index: 4

Rongrong Ji

Citations: 296

h-index: 7

Jing Xu

Citations: 30

h-index: 1

디퓨전 기반 비디오 생성은 시각적 품질과 시간적 일관성 측면에서 크게 발전했지만, 전체 어텐션의 2차 복잡성으로 인해 실제 적용에는 제한이 있습니다. 학습이 필요 없는 희소 어텐션은 사전 학습된 모델을 재학습하지 않고 가속화할 수 있다는 점에서 매력적이지만, 기존의 온라인 top-p 희소 어텐션은 여전히 마스크 예측에 상당한 비용을 사용하며, 헤드 수준의 이질성을 고려하지 않고 공유된 임계값을 적용합니다. 우리는 이러한 간과된 두 가지 요인이 비디오 DiT에서 학습이 필요 없는 희소 어텐션의 실제적인 속도-품질 균형을 제한한다는 것을 보여줍니다. 이를 해결하기 위해, 우리는 두 가지 모듈형 구성 요소인 헤드 단위 적응형 프레임워크를 제안합니다. 첫째, 쿼리-키 드리프트를 기반으로 불필요한 마스크 예측을 건너뛰는 Temporal Mask Reuse입니다. 둘째, 전역 희소성 예산을 설정하면서 모델 출력 오류를 최소화하여 각 헤드에 최적의 top-p 임계값을 할당하는 Error-guided Budgeted Calibration입니다. Wan2.1-1.3B 및 Wan2.1-14B 모델에서, 우리의 방법은 XAttention 및 SVG2를 지속적으로 개선하여 720P 해상도에서 최대 1.93배의 속도 향상을 달성하면서도 경쟁력 있는 비디오 품질 및 유사성 지표를 유지합니다.

Original Abstract

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

0 Citations

0 Influential

13.5 Altmetric

67.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!