Monet: Reasoning in Latent Visual Space Beyond Images and Language
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
2025-11-27
Summary
This paper introduces Monet, a new way to get AI to 'think' with images, making it better at visual reasoning tasks. It's about teaching AI to not just *see* images, but to actually process and use visual information in a more abstract and human-like way during problem-solving.
What's the problem?
Current AI models that try to reason with images rely on external tools to analyze visuals, which limits their flexibility and ability to perform complex visual thinking. They aren't very good at abstractly understanding what they 'see' like humans do. Specifically, training these models is expensive because it's hard to connect the visual information to the AI's reasoning process, and there isn't enough guidance on *how* the AI should represent visual ideas internally. Also, a common technique to improve AI reasoning, called GRPO, doesn't work well when applied to these internal visual representations.
What's the solution?
The researchers developed a three-step training process called 'distillation-based supervised fine-tuning' to help the AI learn to create its own internal visual 'thoughts' – essentially, continuous numerical representations of what it's seeing. They also created a new method called 'VLPO' (Visual-latent Policy Optimization) which is a type of reinforcement learning that specifically focuses on improving how the AI uses these internal visual representations during reasoning. To support this training, they built a large dataset of examples showing step-by-step reasoning with images, charts, and text. The resulting model is called Monet-7B.
Why it matters?
This work is important because it moves AI closer to being able to truly understand and reason about the visual world, not just recognize objects. By allowing the AI to internally represent and manipulate visual information, it can tackle more complex and abstract visual problems, and potentially generalize better to new situations. This could have applications in areas like robotics, image analysis, and even helping people understand complex visual data.
Abstract
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.