Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long

2026-01-28

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Summary

This research explores how AI can reason more like humans by building and using 'world models' – internal representations of how things work. It focuses on whether combining visual information with language processing, instead of just using language, improves AI's reasoning abilities, especially in situations involving the physical world.

What's the problem?

Current AI systems, like large language models, are really good at tasks that rely on words and logic, like math or coding. However, they struggle with things that require understanding the physical world and spatial relationships, things humans find easy. This is likely because they primarily rely on verbal reasoning and lack the richer, more intuitive understanding that comes from also processing visual information. The question is whether adding visual processing actually *helps* AI reason better, and if so, *when*.

What's the solution?

The researchers investigated this by proposing that visual generation is better for modeling the physical world, while verbal reasoning is sufficient for abstract tasks. They developed a new set of tests, called VisWorld-Eval, that require both visual and verbal reasoning. Then, they tested a powerful AI model that can handle both images and text, comparing its performance when reasoning using only words versus when it combined words and generated images as part of its thought process. They found that the combined approach significantly improved performance on tasks where visual understanding was crucial.

Why it matters?

This work shows that incorporating visual information into AI reasoning can make it more powerful and human-like, particularly when dealing with real-world scenarios. It clarifies *how* and *when* visual processing is beneficial, paving the way for developing AI systems that can better understand and interact with the physical world around them, moving beyond just processing text.

Abstract

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.

View Paper