Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
2025-03-04
Summary
This paper talks about Visual-RFT, a new method for improving AI models that work with images. It uses a technique called reinforcement learning to help these models understand and describe images better, even when there's not much training data available.
What's the problem?
Current AI models that work with both text and images (called Large Vision-Language Models or LVLMs) often need a lot of labeled data to improve their performance on specific tasks. This can be expensive and time-consuming. Also, these models sometimes struggle to explain their reasoning or adapt to new types of images they haven't seen before.
What's the solution?
The researchers created Visual-RFT, which uses reinforcement learning to fine-tune LVLMs. This means the AI model tries different ways to describe or analyze an image, and then gets feedback on how well it did. The feedback comes from specially designed 'reward functions' that measure things like how accurately the model identifies objects in images. By learning from this feedback, the model can improve its performance and reasoning abilities, even with limited training data.
Why it matters?
This matters because it could make AI systems that work with images much more efficient and adaptable. Visual-RFT showed big improvements in tasks like classifying specific types of images and detecting objects, even when it only had a few examples to learn from. This could lead to better AI assistants that can understand and describe images more accurately, which could be useful in fields like medicine, security, or helping people with visual impairments.
Abstract
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot <PRE_TAG>fine-grained image classification</POST_TAG> with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by 21.9 on COCO's two-shot setting and 15.4 on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.