Imagination Helps Visual Reasoning, But Not Yet in Latent Space

You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

2026-02-27

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

Summary

This research investigates how well 'latent visual reasoning' – a technique where AI models use hidden internal steps to solve visual problems – actually works, and whether those hidden steps are truly necessary.

What's the problem?

Latent visual reasoning is a promising approach, but it's unclear *why* it seems to work. Researchers wanted to figure out if the hidden steps the AI takes (called 'latent tokens') are actually contributing to the correct answer, or if the AI is succeeding for some other reason. Essentially, they questioned if the AI is truly 'thinking' through these hidden steps or if it's just going through the motions.

What's the solution?

The researchers used a method called Causal Mediation Analysis to break down the process into cause and effect. They treated the initial image as the 'cause,' the latent tokens as the 'middle step,' and the final answer as the 'effect.' They then tested how much changing the image affected the latent tokens, and how much changing the latent tokens affected the answer. They found that changing the image didn't really change the latent tokens much, and changing the latent tokens didn't really change the answer. This suggested the latent tokens weren't doing much! Based on this, they created a new method called CapImagine, which focuses on having the model directly 'imagine' what's happening in the image using text descriptions instead of relying on these hidden steps.

Why it matters?

This work challenges the idea that complex hidden steps are essential for visual reasoning in AI. By showing that a simpler approach – directly imagining using text – can actually perform *better*, it suggests that we might be able to build more efficient and understandable AI systems for solving visual problems. It points towards a future where AI doesn't need to rely on mysterious internal processes to 'think,' but can instead use clear, explicit reasoning.

Abstract

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

View Paper