RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz

2024-11-11

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Summary

This paper introduces RaVL, a new method for finding and fixing misleading connections in fine-tuned vision-language models (VLMs) that can cause errors in understanding images and text.

What's the problem?

Fine-tuned VLMs often make mistakes because they learn incorrect relationships between image features and text descriptions, known as spurious correlations. These misleading connections can hurt the model's performance, especially when it tries to classify images without prior examples (zero-shot performance). Current methods mainly look at the whole image instead of focusing on specific parts, which limits their effectiveness.

What's the solution?

RaVL improves VLMs by focusing on local image features instead of just the overall image. It first identifies these spurious correlations using a method that groups similar parts of images together. Then, it uses a special loss function that helps the model learn to ignore these misleading connections during training. This way, RaVL enhances the model's ability to understand relevant features while reducing errors caused by spurious relationships.

Why it matters?

This research is important because it helps improve the reliability of vision-language models, making them better at understanding images and text together. By addressing spurious correlations effectively, RaVL can lead to more accurate AI systems that perform well in real-world applications, such as image classification and natural language processing.

Abstract

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

View Paper