Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
2025-10-10
Summary
This paper focuses on making diffusion models, which are used to generate images and videos from text, much faster and more efficient. They achieve this by improving a technique called 'consistency distillation', allowing it to work with very large and complex models.
What's the problem?
While 'consistency distillation' works well in theory and with smaller models, it's been difficult to apply to the really big image and video generation models we see today. This is because calculating a key part of the process, called the Jacobian-vector product, requires a lot of computing power. Also, the initial method struggled to create images with fine details, seeming to blur them out because of errors building up during the process.
What's the solution?
The researchers developed a faster way to calculate the Jacobian-vector product using a technique called FlashAttention-2, allowing them to train these large models. They then introduced a new method called 'score-regularized continuous-time consistency' (rCM). This method adds a 'quality check' during the image generation process, helping to focus on creating sharp details while still maintaining the diversity of the generated images. It's like adding a second opinion to make sure the image looks good.
Why it matters?
This work is important because it significantly speeds up the image and video generation process – up to 15 to 50 times faster! It allows for high-quality images and videos to be created quickly and efficiently, even with very large models. This makes advanced image and video generation more practical and accessible, and provides a solid foundation for future improvements in this field.
Abstract
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only 1sim4 steps, accelerating diffusion sampling by 15timessim50times. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.