Reinforced Visual Perception with Tools

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, Ranjay Krishna

2025-09-09

Summary

This paper focuses on improving how well computers can 'think' with images, a skill humans do naturally. It's about making computer programs better at solving problems that require both seeing and reasoning, like understanding what's happening in a picture and then using that information to answer a question.

What's the problem?

Current computer systems are good at *seeing* things in images, but struggle to actually *reason* about them. Existing methods to improve this involve teaching these systems with lots of labeled examples, which is expensive and doesn't always work well when faced with new, slightly different situations. Simply showing the computer examples isn't enough to teach it how to strategically use visual 'tools' to solve complex problems.

What's the solution?

The researchers developed a new method called ReVPT that uses a technique called reinforcement learning. Think of it like teaching a computer through trial and error, rewarding it when it makes good decisions. ReVPT trains the computer to use a set of visual 'tools' – imagine things like a calculator or a shape recognizer – to help it solve problems. It's based on a specific reinforcement learning algorithm designed to make the computer learn how to best combine these tools to get the right answer.

Why it matters?

This work is important because it shows a way to make computers much better at visual reasoning *without* needing tons of labeled data. ReVPT achieves better results than previous methods on several challenging tests, meaning it's a significant step towards creating AI systems that can truly understand and interact with the visual world like humans do. It also provides valuable insights into how to best train AI to use visual tools effectively.

Abstract

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

View Paper