RL makes MLLMs see better than SFT
Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
2025-10-21
Summary
This paper investigates how the way we train Multimodal Language Models (MLLMs), which combine vision and language, affects not just the model as a whole, but specifically the part that 'sees' – the vision encoder. It finds that a newer training method, Reinforcement Learning, significantly improves the vision encoder's ability to understand images compared to the more traditional Supervised Finetuning.
What's the problem?
Most research assumes that MLLMs are good simply because the language part of the model is so powerful. This has led to a lack of understanding about how the vision encoder works and how different training methods change it. As training methods evolve, especially with the move towards Reinforcement Learning, it's becoming even more important to understand how these changes impact the vision encoder's ability to process images effectively. Essentially, we don't know *why* some MLLMs are better at understanding images than others, and if the vision component is even being properly utilized.
What's the solution?
The researchers compared MLLMs trained using Supervised Finetuning and Reinforcement Learning. They tested the vision encoder's abilities directly, using tasks like identifying objects in images and outlining shapes. They also looked at how the encoder 'sees' images by visualizing its internal processes. They discovered that Reinforcement Learning creates a vision encoder that's better at recognizing details and pinpointing locations within an image. Based on this, they developed a new training method called PIVOT, which specifically optimizes the vision encoder using a preference-based approach.
Why it matters?
This work is important because it shows that improving the vision encoder is crucial for building better MLLMs. The PIVOT method is particularly exciting because it can create a strong vision encoder with very little computing power compared to traditional methods. This means we can build more capable and efficient MLLMs, opening up possibilities for advancements in areas like image understanding and visual question answering.
Abstract
A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/