Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
2024-07-10

Summary
This paper talks about the limitations of vision-language models (VLMs), like GPT-4o and Gemini 1.5 Pro, in performing simple visual tasks that humans can easily handle. Despite their high scores on various benchmarks, these models struggle with basic visual understanding.
What's the problem?
The main problem is that even though VLMs are designed to understand both images and text, they fail at very simple visual tasks. For example, they have difficulty determining if two circles overlap, if two lines intersect, identifying which letter is circled in a word, or counting circles in a logo. This suggests that their ability to 'see' and interpret visual information is much weaker than expected.
What's the solution?
To highlight these issues, the authors tested four advanced VLMs on seven basic visual tasks that should be easy for humans. The results showed that these models performed poorly, indicating that they do not truly understand visual information as well as they should. The authors compare the models' performance to someone who has trouble seeing clearly or is blind, suggesting they make educated guesses rather than accurately interpreting what they see.
Why it matters?
This research is important because it raises awareness about the limitations of current VLMs in understanding visual information. By identifying these weaknesses, it encourages further improvement in AI systems so that they can better interpret images like humans do. This is crucial for applications relying on accurate visual understanding, such as image recognition software and interactive AI systems.
Abstract
Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/