Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim

2024-10-13

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Summary

This paper presents a new method called Fine-grained Selective Calibrated CLIP (FSC-CLIP) to improve how vision and language models (VLMs) understand and combine information from images and text without losing their ability to perform other tasks.

What's the problem?

Traditional methods for fine-tuning VLMs often enhance their ability to reason about images and text together but can hurt their performance on other tasks, especially when they need to recognize or relate different types of data. This happens because these methods use a technique called global hard negative loss, which can confuse the model by pushing it to focus too much on similar examples, leading to weaker overall performance.

What's the solution?

To solve this problem, the authors developed FSC-CLIP, which uses two key innovations: local hard negative loss and selective calibrated regularization. Local hard negative loss focuses on specific parts of the data instead of the whole picture, allowing for better learning without losing important information. Selective calibrated regularization helps the model maintain its ability to understand both images and text while dealing with challenging examples. The results showed that FSC-CLIP performs as well as the best models in understanding complex relationships while still being effective in various tasks.

Why it matters?

This research is important because it offers a way to make AI models smarter at combining visual and textual information without sacrificing their performance in other areas. By improving how these models learn, FSC-CLIP could enhance applications like image captioning, visual question answering, and any task that requires a deep understanding of both images and text.

Abstract

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

View Paper