When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

2025-11-05

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Summary

This paper introduces MIRA, a new way to test how well AI models can reason when they need to 'think visually' – meaning they have to imagine or create images as part of solving a problem.

What's the problem?

Current AI models are really good at processing text and following step-by-step instructions (called 'Chain of Thought' reasoning). However, many real-world problems aren't easily explained with just words; they require understanding spatial relationships, structures, or visualizing steps. Existing tests didn't adequately challenge AI to use visual thinking like humans do when we sketch things out to help us solve problems.

What's the solution?

The researchers created MIRA, a collection of 546 problems that *require* visual reasoning. These problems aren't just about looking at an image and answering a question; the AI needs to essentially 'draw to think' by generating or using intermediate images (like diagrams or sketches) to get to the final answer. They tested different ways of giving the AI information – just the image and question, text prompts to help it think, or both the prompts *and* visual clues. They also tested how well models performed when given multiple attempts to get the right answer.

Why it matters?

The results showed that AI models struggle with these visually-demanding problems when relying only on text. However, providing visual cues significantly improved their performance, boosting accuracy by over 33% on average. This highlights that visual reasoning is a crucial skill for AI and that simply making models better at processing text isn't enough to solve complex problems. It suggests that future AI development needs to focus on incorporating the ability to 'see' and 'visualize' information.

Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

View Paper