CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao

2025-06-18

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language
Models in Tool-Calling Error Scenarios

Summary

This paper talks about CRITICTOOL, a special benchmark designed to test how well large language models can recognize and fix their own mistakes when using external tools during complex tasks.

What's the problem?

The problem is that when AI models use tools to complete tasks, they often make errors which can be hard to detect and fix, and current methods didn’t thoroughly evaluate how well these models can handle such mistakes.

What's the solution?

The researchers created CRITICTOOL, which includes many different types of real-world errors that happen during tool usage. It examines models’ abilities to reflect on their actions, correct mistakes, retry tasks, or skip when needed. They tested various language models to analyze how effectively they self-critique and recover from errors.

Why it matters?

This matters because improving models’ skills to spot and fix their own errors makes AI more reliable and robust, especially when performing complicated tasks that rely on using multiple tools correctly.

Abstract

A comprehensive benchmark, CRITICTOOL, evaluates and enhances the robustness of large language models in handling errors during tool usage.

View Paper