Improved Training Technique for Latent Consistency Models

Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas

2025-02-04

Improved Training Technique for Latent Consistency Models

Summary

This paper talks about RandLoRA, a new method for fine-tuning large AI models that combines efficiency with powerful updates. It improves on previous techniques like LoRA by allowing full-rank updates while keeping memory and computational costs low.

What's the problem?

Fine-tuning large AI models is expensive and requires a lot of computational resources, especially when trying to adapt them for specific tasks. Methods like LoRA reduce memory usage by limiting updates to low-rank matrices, but this approach can weaken the model’s ability to handle complex tasks, creating a gap in performance compared to standard fine-tuning.

What's the solution?

The researchers developed RandLoRA, which uses random low-rank matrices combined in a smart way to create full-rank updates without increasing the number of trainable parameters or memory usage. By focusing optimization on diagonal scaling matrices applied to fixed random matrices, RandLoRA overcomes the limitations of low-rank methods while remaining efficient. Experiments showed that RandLoRA performs better than LoRA across tasks involving vision, language, and vision-language processing, often closing the performance gap with standard fine-tuning.

Why it matters?

This research is important because it makes fine-tuning large AI models more accessible and effective for complex tasks. RandLoRA provides a way to achieve high performance without requiring expensive hardware or excessive memory, making advanced AI tools more practical for real-world applications in areas like image analysis, language understanding, and multimodal tasks.

Abstract

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

View Paper