Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan

2025-06-23

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens

Summary

This paper talks about Mirage, a method that improves how AI models understand and reason about both images and text by using hidden visual information during the process of generating text.

What's the problem?

The problem is that many vision-language models struggle to combine information from images and words effectively during reasoning, especially when they need to think deeply without actually creating images.

What's the solution?

The researchers introduced latent visual tokens, which are hidden visual clues integrated into the text generation process without making actual pictures. This helps the model use visual information internally to reason better across multiple types of data.

Why it matters?

This matters because it makes AI better at understanding complex tasks that involve both images and language, improving things like answering questions about pictures, storytelling, and other applications that need smart multimodal reasoning.

Abstract

Mirage enhances vision-language models by integrating latent visual tokens into text decoding to improve multimodal reasoning without generating explicit images.

View Paper