Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez

2024-12-06

Summary

This paper talks about a new method for fine-tuning large vision-language models (LVLMs) to improve their ability to understand and distinguish between images and text more effectively.

What's the problem?

Current models like CLIP are good at linking images and text but have limited understanding of language, often treating words as separate pieces instead of understanding their context. Additionally, while LVLMs can reason about images and text, they are not as effective for tasks that require precise discrimination between similar items.

What's the solution?

The authors propose a new training method that combines the strengths of both contrastively-trained models and generative models. This method fine-tunes LVLMs by using a mix of short and long text-image pairs to help the model learn better discrimination skills. They introduce a new optimization framework that balances different types of training losses, which helps the model understand both the visual and textual information more deeply. This approach allows LVLMs to perform better on tasks that require distinguishing between similar images or concepts.

Why it matters?

This research is important because it enhances the capabilities of AI models in understanding complex relationships between images and text. By improving how these models learn to discriminate between different inputs, this work could lead to advancements in various applications, such as image recognition, content generation, and even assistive technologies that rely on accurate visual understanding.

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

View Paper