Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang, Bohan Zhuang, Gao Huang

2025-12-16

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Summary

This research explores how to speed up the process of creating images from text descriptions using a technique called 'diffusion distillation'. It focuses on making this process faster and more efficient, particularly when dealing with complex text prompts instead of simple categories.

What's the problem?

While diffusion distillation works well for generating images based on pre-defined classes (like 'cat' or 'dog'), it hasn't been thoroughly tested with open-ended text-to-image generation where you can type in any description. The challenge is that moving from simple labels to detailed sentences introduces new difficulties in effectively 'teaching' a smaller, faster model to mimic a larger, more powerful one. Existing methods weren't designed for the nuances of language.

What's the solution?

The researchers systematically tested and compared different distillation techniques using a powerful text-to-image model called FLUX.1-lite as the 'teacher'. They created a unified way to look at these methods, pinpointing the specific problems that arise when using text prompts. They also provided practical advice on things like how to prepare the text, the best way to build the smaller 'student' model, and what settings to use. Finally, they released the code and pre-trained models so others can use their work.

Why it matters?

This work is important because it lays the groundwork for creating fast, high-quality image generators that can turn any text description into a realistic image without requiring massive computing resources. This makes it more practical to use these technologies in real-world applications like design, content creation, and more.

Abstract

Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.

View Paper