TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, Kaipeng Zhang
2025-11-04
Summary
This paper introduces a new way to test how well AI models can 'think with images' by not just *looking* at pictures, but actively changing them using tools to solve problems.
What's the problem?
Current tests for AI image understanding are too simple. They mostly check if an AI can find things in a picture or cut it up, but don't see if it can actually *use* tools to manipulate images in a smart, step-by-step way to figure things out. Existing benchmarks don't challenge AI to really reason about images and how to change them to get answers.
What's the solution?
The researchers created a new, much harder test called TIR-Bench. This test has 13 different challenges where the AI needs to use various tools to process and change images, like editing or transforming them, to solve a problem. They tested 22 different AI models to see how well they performed on this new benchmark, and also looked at whether training the AI specifically to use tools helped it improve.
Why it matters?
This work is important because it pushes the field of AI towards more intelligent image understanding. If AI can truly 'think with images' and use tools effectively, it can solve more complex real-world problems, and this new benchmark provides a better way to measure and improve those abilities.
Abstract
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-with-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-with-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce TIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.