Diffusion Feedback Helps CLIP See Better
Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
2024-07-30

Summary
This paper discusses a new method called DIVA that helps improve the visual understanding of CLIP, a model used for connecting images and text. DIVA uses a technique called diffusion feedback to enhance how well CLIP can interpret visual information.
What's the problem?
CLIP has some significant visual limitations; it struggles to accurately understand details like orientation, quantity, color, and structure in images. These shortcomings arise because the image-text pairs used to train CLIP are often biased and lack sufficient diversity. As a result, CLIP's ability to help multimodal large language models (MLLMs) perform tasks is limited.
What's the solution?
To address these issues, the authors developed DIVA, which stands for DIffusion model as a Visual Assistant for CLIP. This method involves using feedback from text-to-image diffusion models to improve CLIP's visual representations without needing corresponding text. By applying this self-supervised diffusion process, DIVA significantly enhances CLIP's performance on various benchmarks that test fine-grained visual abilities, showing improvements of about 3-7%. It also helps MLLMs and vision models perform better in understanding and segmenting images.
Why it matters?
This research is important because it enhances the capabilities of AI systems that rely on visual understanding. By improving CLIP's performance, DIVA can lead to better results in applications that require interpreting images alongside text, making AI more effective in tasks like image classification and multimodal understanding. This could have a significant impact on how AI interacts with the world and processes information.
Abstract
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code will be available at https://github.com/baaivision/DIVA.