Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

2026-04-10

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Summary

This paper focuses on making AI agents, which can see and interact with the world, smarter about *when* to use tools like search engines or calculators. It addresses a key weakness in these agents: they often rely on tools even when they already have the answer from what they can directly observe.

What's the problem?

Current AI agents are too quick to use tools. Imagine an agent looking at a picture of a red apple and asking a question about its color instead of just noticing it's red. This 'blind tool invocation' slows things down and introduces errors because tools aren't always perfect. Previous attempts to fix this by simply penalizing tool use haven't worked well – either the agent stops using tools altogether, even when helpful, or the penalty is ignored when the agent is focused on getting the right answer.

What's the solution?

The researchers developed a new framework called HDPO. Instead of trying to balance accuracy and tool use with a single penalty, HDPO treats tool efficiency as a separate goal. It encourages the agent to first learn how to solve problems *without* tools, and *then* learn to use tools only when absolutely necessary, but only if it's already on the path to a correct answer. This creates a learning process where the agent builds a strong understanding before relying on external help.

Why it matters?

This work is important because it makes AI agents more efficient and reliable. By reducing unnecessary tool use, these agents can respond faster and make fewer mistakes. This is a big step towards creating AI systems that can reason more like humans – using knowledge and tools strategically, rather than relying on tools as a default.

Abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

View Paper