CleanDIFT: Diffusion Features without Noise

Nick Stracke, Stefan Andreas Baumann, Kolja Bauer, Frank Fundel, Björn Ommer

2024-12-05

CleanDIFT: Diffusion Features without Noise

Summary

This paper introduces CleanDIFT, a new method for extracting high-quality, noise-free features from diffusion models, which are used for various tasks in AI.

What's the problem?

Currently, to get useful features from diffusion models, images often need to be intentionally made noisy. This noise can actually hurt the quality of the features extracted from the images, making them less effective for tasks like image recognition or generation. Existing methods that try to improve this often still rely on noise, which complicates the process and can lead to poor performance.

What's the solution?

CleanDIFT solves this problem by introducing a lightweight and unsupervised fine-tuning method that allows diffusion models to provide high-quality features without needing to add noise. Instead of relying on noisy images, CleanDIFT trains the model using clean images, which helps it learn better representations. This method significantly improves the performance of the features extracted from the models across various tasks compared to previous methods that still used noise.

Why it matters?

This research is important because it enhances how AI systems can extract useful information from images, making them more efficient and effective. By eliminating the need for noise in feature extraction, CleanDIFT allows for faster and more accurate applications in fields like computer vision and machine learning, leading to better outcomes in tasks such as image classification and generation.

Abstract

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

View Paper