GTA: A Benchmark for General Tool Agents

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le

2024-07-13

GTA: A Benchmark for General Tool Agents

Summary

This paper presents the GTA benchmark, which is designed to evaluate how well large language models (LLMs) can use different tools to solve real-world problems. It focuses on assessing the models' abilities to understand and apply various tools in practical situations.

What's the problem?

Current evaluations of LLMs often do not reflect real-world scenarios. They typically use AI-generated questions, simple tasks, and dummy tools, which do not accurately test the models' problem-solving skills. This means that we don't really know how well these models can handle complex tasks that require using multiple tools effectively.

What's the solution?

To address this issue, the researchers created the GTA benchmark, which includes three main features: (1) Real user queries that are written by humans and require the model to figure out which tools to use; (2) Actual tools that can perform tasks in areas like perception, operation, logic, and creativity; and (3) Real multimodal inputs, such as images and tables, that provide context for the tasks. The benchmark consists of 229 tasks that require the models to reason through their tool choices and execution steps.

Why it matters?

This research is important because it helps improve our understanding of how capable AI models are in using tools in real-life situations. By identifying the limitations of current LLMs in tool use, the GTA benchmark can guide future developments in creating more versatile AI agents that can assist with a wide range of tasks in everyday life.

Abstract

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

View Paper