Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang

2025-12-19

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Summary

This paper introduces a new way for AI models that understand both images and text to think through problems, making them more accurate and efficient.

What's the problem?

Current AI models that combine visual and language understanding often struggle because they follow a rigid, step-by-step reasoning process, similar to writing out every single thought. This can be unstable, meaning small changes can throw them off, and it takes a lot of computing power. The way these models currently handle information feels unnatural, unlike how humans think which involves constantly going back and forth between seeing and reasoning.

What's the solution?

The researchers developed a framework called DMLR that allows the AI to dynamically interweave reasoning and visual perception. It works by having the model refine its 'thoughts' (represented as internal codes) based on how confident it is in those thoughts. Simultaneously, it identifies and focuses on the most important parts of an image at each step of the thinking process, constantly updating which visual details are relevant. This creates a more fluid and efficient process, mimicking how our brains work.

Why it matters?

This research is important because it improves the ability of AI to solve complex problems that require both visual understanding and reasoning, like answering questions about images or videos. Importantly, it does so without significantly increasing the computational cost, making it more practical for real-world applications. It represents a step towards more human-like AI reasoning.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

View Paper