Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung

2025-07-04

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers

Summary

This paper talks about how multimodal AI has evolved from simply using images as extra information to making vision an important part of the thinking process. It describes three main stages of development and discusses challenges that come with integrating different kinds of data like images and text when reasoning.

What's the problem?

The problem is that early AI systems treated images as just background context rather than actively using them to think and reason. This made it hard for these systems to solve complex problems that need a deep understanding of both visual and textual information together.

What's the solution?

The researchers explained the progress of multimodal AI in three stages: first, where vision was passive; second, where vision started to be involved in understanding; and third, where vision is fully integrated into reasoning. They identified the main difficulties, such as focusing on important parts of images during reasoning and managing step-by-step thought processes that combine both words and pictures.

Why it matters?

This matters because better multimodal reasoning helps AI become smarter and more like humans, able to solve problems by looking at pictures and reading text together. It opens new possibilities for AI in areas like education, healthcare, and any field needing deep understanding of complex, real-world information.

Abstract

The survey outlines the evolution of multimodal AI from treating vision as static context to integrating it dynamically into the reasoning process, highlighting three stages and key challenges.

View Paper