Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang

2025-11-26

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Summary

This paper introduces a new way to train computer vision models to not just *see* images, but to actually *think* through problems using those images, especially when it requires using tools to help solve them.

What's the problem?

Current vision-language models are good at understanding what's in an image, but they struggle with tasks that require multiple steps and using tools – like identifying objects, then using that information to perform another action. They can reason with text pretty well, but fall apart when they need to combine that reasoning with interacting with the visual world and using tools to get things done.

What's the solution?

The researchers created a training environment called VISTA-Gym. This environment lets models practice a variety of real-world tasks that require visual reasoning and using tools. They then used this environment to train a model, VISTA-R1, to learn how to choose the right tools, use them correctly, and combine tool use with its own reasoning abilities through a process called reinforcement learning, where the model learns by trial and error.

Why it matters?

This work is important because it shows a way to significantly improve the reasoning abilities of vision-language models. VISTA-R1 performed much better than existing models on complex visual reasoning tasks, demonstrating that this training approach can unlock a new level of intelligence in these systems, allowing them to tackle more complex, real-world problems.

Abstract

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

View Paper