Latent Implicit Visual Reasoning

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

2025-12-26

Summary

This paper focuses on improving how well Large Multimodal Models, which are AI systems that can understand both text and images, can solve problems that require strong visual reasoning skills.

What's the problem?

Current Large Multimodal Models tend to rely heavily on text for reasoning, even when dealing with visual tasks. Attempts to help them focus on the important parts of an image often involve giving the model specific instructions about *what* to look for, like highlighting objects or providing depth information. This is problematic because it requires a lot of extra work to create these instructions, limits the model's ability to find its own solutions, and doesn't work well when the task is complex or new.

What's the solution?

The researchers developed a new method that allows the model to learn to identify important visual 'tokens' or features in an image *without* being explicitly told what to look for. Essentially, the model learns to focus on the parts of the image that are most relevant to solving the task at hand, and it does this automatically by re-encoding the image in a way that highlights those key areas. This approach is flexible and doesn't require task-specific instructions.

Why it matters?

This work is important because it makes Large Multimodal Models more capable of handling visual reasoning tasks, even those that are difficult or haven't been seen before. By allowing the model to discover its own visual strategies, it reduces the need for manual labeling and improves the model's ability to generalize to new situations and different types of problems.

Abstract

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

View Paper