left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S

2025-11-04

left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Summary

This paper focuses on the difficulty computers have with rebus puzzles, which are those visual word puzzles that use pictures and symbols to represent words or phrases. The researchers created a large collection of these puzzles and a new method to help computers solve them better.

What's the problem?

Current computer programs that can 'see' and understand language, called Vision-Language Models, struggle with rebus puzzles. These puzzles aren't just about recognizing images; they require understanding how images relate to words in a clever, sometimes indirect way, and involve things like common sense and figuring out multiple steps to get the answer. Existing methods weren't good enough at solving these kinds of puzzles.

What's the solution?

The researchers built a large dataset of over 1,300 rebus puzzles, covering many different categories like food and sports. More importantly, they developed a new technique called RebusDescProgICE. This technique combines describing the puzzle in plain language with a more structured, code-like approach to reasoning about the images and words. It also improves how the computer chooses examples to learn from. This new method significantly improved the performance of these Vision-Language Models on rebus puzzles.

Why it matters?

This work is important because it pushes the boundaries of what computers can understand. Solving rebus puzzles requires a level of reasoning and visual understanding that's closer to how humans think. By improving computers' ability to tackle these puzzles, we're moving closer to creating AI that can truly understand and interact with the world in a more intelligent way.

Abstract

Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present left|,circlearrowright,text{BUS},right|, a large and diverse benchmark of 1,333 English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose RebusDescProgICE, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on left|,circlearrowright,text{BUS},right| by 2.1-4.1% and 20-30% using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

View Paper