2602.20497v1 Feb 24, 2026 cs.CV

LESA: 학습 가능한 단계 인지 예측기를 활용한 확산 모델 가속화

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Peiliang Cai

Citations: 47

h-index: 4

Jiacheng Liu

Citations: 158

h-index: 6

Hao Xu

Citations: 6

h-index: 1

Xinyu Wang

Citations: 120

h-index: 6

Chang Zou

Citations: 406

h-index: 10

Linfeng Zhang

Citations: 183

h-index: 7

확산 모델은 이미지 및 비디오 생성 작업에서 뛰어난 성과를 거두었지만, 확산 트랜스포머(DiT)의 높은 계산 복잡도는 실제 적용에 상당한 어려움을 야기합니다. 특징 캐싱은 유망한 가속화 전략이지만, 단순한 재사용 또는 학습이 필요 없는 예측 방식을 기반으로 하는 기존 방법은 확산 과정의 복잡하고 단계에 따라 달라지는 동역학에 적응하기 어려우며, 종종 품질 저하를 초래하고 표준 노이즈 제거 프로세스와의 일관성을 유지하지 못합니다. 이러한 문제를 해결하기 위해, 우리는 두 단계로 학습하는 학습 가능한 단계 인지(LESA) 예측기 프레임워크를 제안합니다. 우리의 접근 방식은 콜모고로프-아르놀드 네트워크(KAN)를 사용하여 데이터로부터 정확한 시간적 특징 매핑을 학습합니다. 또한, 다양한 노이즈 레벨 단계에 특화된 예측기를 할당하는 다단계, 다전문가 아키텍처를 도입하여 더욱 정밀하고 강력한 특징 예측을 가능하게 합니다. 광범위한 실험 결과, 우리 방법은 높은 품질을 유지하면서 상당한 가속화를 달성함을 보여줍니다. 실험 결과는 FLUX.1-dev에서 5.00배의 가속화, Qwen-Image에서 이전 최고 성능(TaylorSeer)보다 20.2% 품질 향상, HunyuanVideo에서 TaylorSeer보다 24.7% PSNR 향상을 보여줍니다. 텍스트-이미지 및 텍스트-비디오 합성 모두에서 최첨단 성능을 달성하여, 우리의 학습 기반 프레임워크가 다양한 모델에 걸쳐 효과적이고 일반화 가능함을 검증합니다. 우리의 코드는 보충 자료에 포함되어 있으며 GitHub에 공개될 예정입니다.

Original Abstract

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!