Diffusion Self-Distillation for Zero-Shot Customized Image Generation
Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein
2024-11-28

Summary
This paper presents a method called Diffusion Self-Distillation, which helps create customized images based on text descriptions without needing a lot of specific training data.
What's the problem?
Creating images that match specific descriptions can be hard because there isn't enough high-quality paired data (images with matching text) to train models effectively. Artists often want to generate images that preserve the identity of a subject while changing the context, but existing methods struggle with this task.
What's the solution?
The authors propose using a pre-trained text-to-image model to generate its own training dataset. They first create a large collection of paired images and descriptions using this model. Then, they fine-tune the model to improve its ability to generate images based on both text and existing images. This approach allows for better customization without needing extensive new data.
Why it matters?
This research is important because it allows artists and creators to generate high-quality, customized images more easily. By improving how models can create images based on text and existing visuals, it opens up new possibilities for creative expression in fields like art, design, and marketing.
Abstract
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.