< Explain other AI papers

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Zhiyu Lin, Yifei Gao, Xian Zhao, Yunfan Yang, Jitao Sang

2025-03-25

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Summary

This paper explores how AI models can combine language understanding with visual information to reason and solve problems more like humans do.

What's the problem?

While AI models are good at language tasks, they need to integrate visual information to achieve more complex reasoning abilities, similar to how humans use both sight and language.

What's the solution?

The paper categorizes different approaches to multimodal reasoning, where AI models use both language and vision. It discusses how these approaches have evolved and what challenges they face.

Why it matters?

This work matters because it paves the way for AI models that can understand and interact with the world more effectively, leading to advancements in areas like robotics, image understanding, and human-computer interaction.

Abstract

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.