Stable Consistency Tuning: Understanding and Improving Consistency Models

Fu-Yun Wang, Zhengyang Geng, Hongsheng Li

2024-10-25

Stable Consistency Tuning: Understanding and Improving Consistency Models

Summary

This paper introduces Stable Consistency Tuning (SCT), a new method to improve the performance and stability of consistency models, which are used for generating high-quality images more quickly than traditional methods.

What's the problem?

While diffusion models can create high-quality images, they are slow because they require many steps to remove noise from images. Consistency models offer a faster alternative but often struggle with training and performance issues. Current training methods can lead to inconsistencies and inefficiencies, making it difficult to achieve the best results.

What's the solution?

The authors propose a framework that models the denoising process as a Markov Decision Process (MDP) and uses Temporal Difference Learning (TD) to improve training. They introduce SCT, which builds on previous methods by using variance-reduced learning techniques to enhance the model's performance. SCT allows for significant improvements in generating images on benchmark datasets like CIFAR-10 and ImageNet-64, achieving new state-of-the-art results in image quality.

Why it matters?

This research is important because it enhances the efficiency and effectiveness of image generation models. By improving how these models are trained, SCT can help create high-quality images faster, which is valuable for applications in art, design, and any field that relies on visual content creation.

Abstract

Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

View Paper