Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper

2025-01-28

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Summary

This paper talks about how Vision Language Models (VLMs) process visual information differently from traditional computer vision models. It explores whether VLMs focus more on the texture or shape of objects in images, and if we can influence this focus through language prompts.

What's the problem?

Computer vision models have typically been more focused on texture rather than shape when identifying objects, which is different from how humans see things. As VLMs combine both visual and language processing, researchers wanted to know if these new models behave more like humans in focusing on shape, or if they inherit the texture bias from traditional vision models.

What's the solution?

The researchers studied various popular VLMs to see how they process visual information. They found that VLMs tend to focus more on shape than their vision-only counterparts, suggesting that adding language processing influences how these models 'see' images. They also experimented with using different language prompts to see if they could steer the models to focus even more on shape. Through these experiments, they were able to increase the models' focus on shape from 49% to 72% just by changing the text prompts.

Why it matters?

This research matters because it helps us understand how AI 'sees' the world and how close it is to human perception. If we can make AI see more like humans do, it could lead to better and more intuitive AI systems for tasks like image recognition, visual question answering, and even robotics. The ability to steer these models using language is particularly exciting, as it suggests we might be able to make AI adapt its visual processing on the fly for different tasks, just by giving it different instructions. While the models still don't match human levels of shape focus, this research opens up new possibilities for improving AI vision systems.

Abstract

Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

View Paper