AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

2025-12-19

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Summary

This paper introduces AdaTooler-V, a new multimodal large language model that's better at deciding when it actually *needs* to use visual tools like image recognition to solve problems.

What's the problem?

Current multimodal AI models often overuse visual tools, meaning they ask for help from things like image analysis even when the problem could be solved just by thinking it through. This wastes processing power and actually makes the AI less accurate because it's getting distracted by unnecessary information.

What's the solution?

The researchers created AdaTooler-V and a new training method called AT-GRPO. AT-GRPO uses a reward system that encourages the model to only use tools when they genuinely improve the answer. They also built two large datasets specifically designed to train the model to make these smart decisions about tool use, covering single images, multiple images, and even videos.

Why it matters?

AdaTooler-V is a significant step forward because it performs better than many existing AI models, even some that are commercially available like GPT-4o and Gemini 1.5 Pro, on complex visual reasoning tasks. It's more efficient and accurate, and the researchers are making all their work – the code, the model, and the data – publicly available so others can build on it.

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

View Paper