Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
Nikita Starodubcev, Mikhail Khoroshikh, Artem Babenko, Dmitry Baranchuk
2024-06-21

Summary
This paper introduces a new method called Invertible Consistency Distillation (iCD), which improves how models generate images from text by allowing them to effectively encode real images into their systems for better editing.
What's the problem?
Current models that convert text to images have made great progress, but they still struggle with certain tasks, especially when it comes to accurately reversing the process to manipulate images. This means that while they can create images from text descriptions, they often can't effectively edit or change existing images, which limits their usefulness in practical applications.
What's the solution?
The researchers developed iCD, a framework that enhances these models by enabling them to encode real images into a form that the model can easily work with. This method allows for high-quality image creation and accurate image manipulation in just a few steps (3-4). They also introduced a technique called dynamic guidance, which helps improve the accuracy of image reconstruction without sacrificing the quality of generated images. This makes the iCD framework a powerful tool for editing images based on text prompts without needing extensive resources.
Why it matters?
This research is important because it represents a significant step forward in making AI models more efficient and effective at understanding and manipulating images. By improving the ability of models to edit images based on text, we can enhance various applications such as graphic design, content creation, and even virtual reality, making these technologies more accessible and user-friendly.
Abstract
Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.