Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

Qisheng Su, Shiting Huang, Zhen Fang, Ziyan Chen, Zehui Chen, Feng Zhao

2026-04-08

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

Summary

This paper investigates how to better measure the efficiency of large language models (LLMs) when they're used with tools, like calculators or search engines. It points out that simply counting tokens or tool uses doesn't accurately reflect how long these processes actually take.

What's the problem?

When LLMs use tools, there's a back-and-forth process. The LLM asks a question, the tool responds, and then the LLM uses that response. This creates delays because the LLM has to wait for the tool, and the tool's long answers can slow things down by filling up the model's memory. Existing ways to measure efficiency, like counting tokens, don't account for these delays and memory issues, so they don't give a true picture of performance.

What's the solution?

The researchers introduced a new metric called 'PTE' (Prefill Token Equivalents). PTE tries to capture the *actual* cost of using tools by considering both the LLM's internal thinking and the time spent waiting for and processing tool responses. It specifically accounts for how much memory the tool responses use and how often information needs to be recalculated because of memory limitations. They tested PTE in a real-world setting and found it’s a much better predictor of how long things take than just counting tokens.

Why it matters?

This work is important because as LLMs become more powerful and are used with more tools, understanding and improving their efficiency becomes crucial. PTE provides a more accurate way to measure this efficiency, which can help developers build faster and more reliable AI systems. The research also showed that simply using *more* tools doesn't necessarily lead to better answers, highlighting the need to use tools strategically.

Abstract

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.

View Paper