SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Trung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, Anh Tran

2024-08-27

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Summary

This paper discusses SwiftBrush v2, an improved version of a text-to-image model that aims to create images as good as those made by more complex models, but in a faster and simpler way.

What's the problem?

The original SwiftBrush model was great at generating a variety of images, but it didn't match the high quality of images produced by more complex models like Stable Diffusion. This difference in quality made it hard for SwiftBrush to compete in generating realistic images.

What's the solution?

To improve SwiftBrush, the authors made several changes to how the model is trained. They introduced better ways to start the training process and used a method called LoRA (Low-Rank Adaptation) to make training more efficient. They also added a new loss function that helps align images with their text descriptions better. By combining different training methods, they created a new version of SwiftBrush that produces higher quality images while still being fast.

Why it matters?

This research is important because it shows how you can enhance simpler models to perform at the level of more complex ones without needing as much computational power. This can make advanced image generation technology more accessible and useful for various applications, such as art creation, advertising, and virtual reality.

Abstract

In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The evaluation code is available at: https://github.com/vinairesearch/swiftbrushv2.

View Paper