ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia
2025-10-14
Summary
This paper introduces a new way to improve Large Vision-and-Language Models (LVLMs) after they've been initially trained, called ViSurf. These models are good at understanding both images and text, but often need further refinement to perform specific tasks better.
What's the problem?
Currently, there are two main methods for improving these models: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT is good for adding new facts, but doesn't always make the model *think* better. RLVR helps with reasoning, but struggles when the task requires knowledge the model doesn't already have. Basically, each method has its weaknesses when used alone.
What's the solution?
ViSurf combines the best parts of both SFT and RLVR into one process. It does this by feeding the model correct answers (like SFT) *while* also letting it learn through trial and error (like RLVR). The researchers also developed some clever techniques to make this combined training process more stable and effective. They essentially give the model both guidance and the chance to figure things out on its own.
Why it matters?
ViSurf is important because it consistently outperforms both SFT and RLVR when used separately, and even beats using them one after the other. This means it's a more efficient and effective way to build powerful vision-and-language models that can both understand information and reason about it, leading to better performance on a variety of tasks.
Abstract
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.