Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin

2025-12-04

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Summary

This paper introduces CodeVision, a new way for AI models that can 'see' and 'think' with images to solve complex problems. Instead of limiting the AI to a specific set of tools, CodeVision lets the AI write its own code to perform any image operation it needs, making it much more flexible and powerful.

What's the problem?

Current AI models that work with images and tools aren't very reliable. They struggle even with simple changes to images, like rotating them or adding a little bit of noise. This means they can easily fail when faced with real-world images that aren't perfect. They also rely on a pre-defined set of tools, which limits what they can do and how well they can adapt to new situations.

What's the solution?

The researchers developed CodeVision, which allows the AI to generate code to interact with images. This 'code-as-tool' approach gives the AI access to a much wider range of image operations. They trained the AI in two steps: first, they showed it lots of examples of how to use tools correctly, and then they used a reward system to encourage the AI to use tools strategically and efficiently, even when it makes mistakes and needs to recover. They also created new datasets and tests to specifically measure how well the AI handles challenging images and complex tasks.

Why it matters?

CodeVision is important because it makes AI models that work with images much more robust and capable. By allowing the AI to write its own code, it can overcome the limitations of fixed toolsets and handle a wider variety of real-world scenarios. This could lead to significant improvements in areas like image analysis, robotics, and other applications where AI needs to understand and interact with the visual world.

Abstract

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

View Paper