2601.09881v1 Jan 14, 2026 cs.CV

빠른 비디오 생성을 위한 전환 매칭 증류

Transition Matching Distillation for Fast Video Generation

Weili Nie

Citations: 427

h-index: 11

Julius Berner

Citations: 55

h-index: 3

Nanye Ma

Citations: 747

h-index: 5

Chao Liu

Citations: 207

h-index: 5

Saining Xie

Citations: 1,994

h-index: 8

Arash Vahdat

Citations: 6,188

h-index: 26

대규모 비디오 확산 및 흐름 모델은 고품질 비디오 생성에서 놀라운 성공을 거두었지만, 비효율적인 다단계 샘플링 과정으로 인해 실시간 인터랙티브 애플리케이션에서의 활용은 제한적입니다. 본 연구에서는 비디오 확산 모델을 효율적인 소규모 단계 생성기로 증류하는 새로운 프레임워크인 전환 매칭 증류(Transition Matching Distillation, TMD)를 제안합니다. TMD의 핵심 아이디어는 확산 모델의 다단계 노이즈 제거 경로를 몇 단계의 확률적 전이 과정과 일치시키는 것입니다. 여기서 각 전이는 가벼운 조건부 흐름으로 모델링됩니다. 효율적인 증류를 위해, 원래 확산 모델의 핵심 부분을 두 가지 구성 요소로 분해합니다. (1) 대부분의 초기 레이어를 포함하는 핵심 부분으로, 각 외부 전이 단계에서 의미론적 표현을 추출합니다. (2) 마지막 몇 개의 레이어로 구성된 흐름 헤드로, 이러한 표현을 활용하여 여러 개의 내부 흐름 업데이트를 수행합니다. 사전 훈련된 비디오 확산 모델을 기반으로, 먼저 흐름 헤드를 모델에 추가하고, 이를 조건부 흐름 맵으로 변환합니다. 그런 다음, 각 전이 단계에서 흐름 헤드를 사용하여 학생 모델에 분포 매칭 증류를 적용합니다. Wan2.1 1.3B 및 14B 텍스트-비디오 모델을 증류하는 광범위한 실험 결과, TMD는 생성 속도와 시각적 품질 간의 유연하고 강력한 균형을 제공하는 것으로 나타났습니다. 특히, TMD는 기존의 증류 모델보다 시각적 충실도와 프롬프트 준수 측면에서 유사한 추론 비용으로 더 우수한 성능을 보입니다. 프로젝트 페이지: https://research.nvidia.com/labs/genair/tmd

Original Abstract

Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd

2 Citations

0 Influential

13 Altmetric

67.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!