PyVision: Agentic Vision with Dynamic Tooling

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei

2025-07-11

PyVision: Agentic Vision with Dynamic Tooling

Summary

This paper talks about PyVision, a new system that lets AI models create, run, and improve their own Python code to solve problems involving images, making them much better at understanding visuals.

What's the problem?

Most AI systems that work with images rely on fixed tools and step-by-step plans, so they can't handle new or tricky tasks that need creative thinking or changing strategies.

What's the solution?

PyVision allows these AI models to write their own Python programs during the task, test and refine them step by step, which means they can build tools exactly suited for the problem they're trying to solve instead of relying on preset methods.

Why it matters?

This matters because it makes AI more flexible and smarter in dealing with complex visual problems, leading to better performance on tests and opening up new possibilities for AI to think and solve tasks more like humans do.

Abstract

PyVision enables MLLMs to autonomously generate, execute, and refine Python-based tools for visual reasoning, achieving significant performance improvements across benchmarks.

View Paper