PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei

2026-02-25

PyVision-RL: Forging Open Agentic Vision Models via RL

Summary

This paper introduces a new way to train AI models that can use tools and think through problems over multiple steps, specifically focusing on models that can 'see' and understand images and videos.

What's the problem?

When you try to teach these AI models using a method called reinforcement learning, they often learn to avoid using tools or thinking for very long, which defeats the purpose of making them 'agentic' – meaning capable of independent action and reasoning. This is called 'interaction collapse', and it limits how helpful these models can be.

What's the solution?

The researchers created a training system called PyVision-RL that fixes this problem. It works by carefully choosing different ways the model can practice, and by rewarding the model for actually *using* tools and taking multiple steps to solve a problem. For videos, the system also smartly picks out only the important parts of the video to look at, instead of processing everything, which makes it much faster and more efficient.

Why it matters?

This work is important because it shows how to build more capable and efficient AI agents that can truly understand and interact with the visual world. By preventing interaction collapse and focusing on relevant information, these models can be scaled up to handle more complex tasks and become genuinely useful tools.

Abstract

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

View Paper