One-step Diffusion Models with f-Divergence Distribution Matching
Yilun Xu, Weili Nie, Arash Vahdat
2025-02-24
Summary
This paper talks about a new method called f-distill that improves how AI models generate images quickly by using different ways to measure and match the quality of generated images to a target distribution.
What's the problem?
Current AI models that generate images, called diffusion models, are slow because they need many steps to create an image. Faster methods exist, but they often miss important details or variety in the images they produce.
What's the solution?
The researchers created f-distill, which uses a mathematical concept called f-divergence to compare the images created by a fast, single-step model (the student) to those of a slower, more accurate model (the teacher). They tested different types of f-divergences and found that some, like Jensen-Shannon divergence, work better than others at helping the student model create high-quality, diverse images quickly.
Why it matters?
This matters because it makes AI image generation much faster without losing quality, which is crucial for real-time applications like video games or interactive design tools. The new method sets a new record for image quality in one-step generation on important benchmarks, potentially making AI-generated images more practical and widely usable in everyday applications.
Abstract
Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel f-divergence minimization framework, termed f-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the f-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative f-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, f-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill