2602.05449v2 Feb 05, 2026 cs.CV

DisCa: 증류 호환 학습 가능한 특징 캐싱을 통한 비디오 디퓨전 트랜스포머 가속화

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Chang Zou

Citations: 406

h-index: 10

Changlin Li

Citations: 1,510

h-index: 4

Yang Li

Citations: 1,627

h-index: 8

Patrol Li

Citations: 2

h-index: 1

Jianbing Wu

Citations: 103

h-index: 2

Xiao He

Citations: 20

h-index: 2

Songtao Liu

Citations: 14

h-index: 2

Zhao Zhong

Citations: 195

h-index: 4

Kai Huang

Citations: 4

h-index: 1

Linfeng Zhang

Citations: 5

h-index: 2

디퓨전 모델은 비디오 생성 분야에서 큰 성공을 거두었지만, 이러한 발전은 급격하게 증가하는 계산 부담을 동반합니다. 기존의 가속화 방법 중, 학습 과정이 필요 없는 특징 캐싱은 상당한 속도 향상을 제공하지만, 압축을 더욱 강화하면 의미 정보와 세부 사항 손실이 발생합니다. 또 다른 널리 사용되는 방법인 학습 기반 스텝 증류는 이미지 생성에서는 효과적이지만, 비디오 생성에서는 몇 단계만 수행해도 심각한 성능 저하가 발생합니다. 게다가, 학습 과정을 거치지 않는 특징 캐싱을 스텝 증류 모델에 직접 적용하면, 샘플링 단계가 줄어들어 품질 손실이 더욱 심화됩니다. 본 논문에서는 처음으로 증류에 적합한 학습 가능한 특징 캐싱 메커니즘을 소개합니다. 기존의 학습 과정을 필요로 하지 않는 휴리스틱 방법 대신, 가벼운 학습 가능한 신경망 예측기를 사용하여 디퓨전 모델의 고차원 특징 변화 과정을 더욱 정확하게 파악합니다. 또한, 대규모 비디오 모델에 대한 높은 압축률 증류의 어려움을 탐구하고, 보다 안정적이고 손실 없는 증류를 달성하기 위해 보수적인 Restricted MeanFlow 방식을 제안합니다. 이러한 노력을 통해, 생성 품질을 유지하면서 가속화 범위를 최대 11.8배까지 향상시켰습니다. 광범위한 실험을 통해 본 방법의 효과를 입증했습니다. 코드는 곧 공개될 예정입니다.

Original Abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code will be made publicly available soon.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!