Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Mark Endo, Serena Yeung-Levy

2025-11-24

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Summary

This paper investigates what happens when you make large AI models that understand both images and text smaller and more efficient, and how reducing the size of the text-understanding part affects their ability to 'see' and reason about images.

What's the problem?

Large AI models that handle images and text are really good, but they're also huge and require a lot of computing power. Simply making the overall model smaller often leads to a big drop in performance, especially when it comes to understanding the visual information in images. The researchers found that shrinking the 'brain' that processes text hurts the model's ability to understand what it's *seeing* more than it impacts its existing language skills, and it wasn't just a problem with complex reasoning – even basic visual perception suffered.

What's the solution?

To fix this, the researchers developed a technique called 'visual extraction tuning'. This essentially trains the model to specifically focus on and pull out the important visual details from an image that are relevant to the task at hand. Then, the model uses these extracted details to think through the problem step-by-step to arrive at an answer. They call this combined approach 'Extract+Think'.

Why it matters?

This work is important because it provides a way to build smaller, more efficient AI models that can still understand images and text effectively. This makes these powerful AI tools more practical for use in real-world applications where resources are limited, like on phones or in other devices without massive computing power.

Abstract

Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

View Paper