Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

2026-02-04

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Summary

This paper introduces a new way to improve how quickly and efficiently AI models create images from text, specifically focusing on a technique called distillation. Distillation essentially teaches a smaller, faster model to mimic a larger, more complex one.

What's the problem?

A common method for this distillation, called Distribution Matching Distillation (DMD), sometimes runs into a problem where the generated images all start to look very similar – this is called 'mode collapse'. Existing fixes for this usually involve adding extra complex components to the model, which makes training harder and requires more computing power.

What's the solution?

The researchers propose a new approach called Diversity-Preserved DMD (DP-DMD). They split the distillation process into two parts. The first step focuses on making sure the generated images are diverse, while the second step refines the quality of those images. Importantly, they prevent the quality-focused step from influencing the diversity step, keeping things separate and simple. This method doesn't need any extra complex parts or additional images to work.

Why it matters?

This work is important because it offers a simpler and more efficient way to create high-quality images from text without the drawbacks of previous methods. By avoiding the need for extra components, it makes the process faster, more stable, and less computationally expensive, potentially making advanced image generation more accessible.

Abstract

Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.

View Paper