Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan
2025-06-23
Summary
This paper talks about Mirage, a method that improves how AI models understand and reason about both images and text by using hidden visual information during the process of generating text.
What's the problem?
The problem is that many vision-language models struggle to combine information from images and words effectively during reasoning, especially when they need to think deeply without actually creating images.
What's the solution?
The researchers introduced latent visual tokens, which are hidden visual clues integrated into the text generation process without making actual pictures. This helps the model use visual information internally to reason better across multiple types of data.
Why it matters?
This matters because it makes AI better at understanding complex tasks that involve both images and language, improving things like answering questions about pictures, storytelling, and other applications that need smart multimodal reasoning.
Abstract
Mirage enhances vision-language models by integrating latent visual tokens into text decoding to improve multimodal reasoning without generating explicit images.