AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu, Ranjay Krishna, Yu Cheng

2026-01-28

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Summary

This paper introduces AdaReasoner, a new approach to help AI models, specifically those that can 'see' and 'understand' language (multimodal large language models), become better at solving complex problems by learning to use tools.

What's the problem?

Current AI models struggle with tasks that require more than just basic understanding. When faced with difficult problems, humans use tools to extend their abilities, but getting AI to effectively choose the right tools, use them in the correct order, and even learn to use *new* tools is a major challenge. Existing methods often require specific training for each tool or task, making them inflexible and limited.

What's the solution?

The researchers developed AdaReasoner, which teaches AI models to use tools as a general skill. They did this in three main ways: first, they created a large dataset showing models how to interact with tools over many steps. Second, they used a reinforcement learning technique called Tool-GRPO to reward the model for successfully completing tasks using tools. Finally, they built in a system that allows the model to adapt how often it uses different tools based on how helpful they are. This allows the AI to figure out which tools are useful for a given situation and combine them effectively, even tools it hasn't seen before.

Why it matters?

This work is important because it moves AI closer to being able to tackle complex, real-world problems. AdaReasoner significantly improves the performance of AI models on challenging tasks, even outperforming some of the most advanced, commercially available systems. By learning to use tools adaptively, AI can become much more versatile and capable, reducing the need for constant retraining and making it more useful in a wider range of applications.

Abstract

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce AdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

View Paper