Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan

2025-05-30

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Summary

This paper talks about how AI models that are good at understanding both pictures and words still have a hard time solving rebus puzzles, which are brain teasers that use pictures and letters to represent words or phrases in creative ways.

What's the problem?

The problem is that even though these vision-language models can recognize objects in images and read text, they struggle when it comes to abstract thinking and picking up on hints or metaphors, which are needed to solve tricky puzzles like rebuses.

What's the solution?

The researchers tested these AI models on a bunch of rebus puzzles and found that, unlike humans, the models often missed the deeper meaning or clever connections between the pictures and words, showing a clear weakness in their reasoning abilities.

Why it matters?

This is important because it shows that while AI can handle straightforward tasks, it still has a long way to go before it can truly think creatively or understand things the way people do, especially when it comes to puzzles or problems that require imagination.

Abstract

Vision-language models struggle with rebus puzzles, which require abstract reasoning and understanding of visual metaphors, despite performing well on simple visual cues.

View Paper