MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, Ge Zhang, Hangyu Guo, Zhaoxiang Zhang, Wenbo Su, Bo Zheng

2024-10-16

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Summary

This paper introduces MTU-Bench, a new benchmark designed to evaluate how well large language models (LLMs) can use tools in various scenarios, improving their ability to perform tasks that require tool usage.

What's the problem?

Current benchmarks for evaluating tool usage in LLMs have limitations, such as covering only a few scenarios and being costly to evaluate. This makes it hard to accurately assess the models' abilities to use tools effectively in real-world situations.

What's the solution?

MTU-Bench addresses these issues by providing a comprehensive evaluation framework that includes five different scenarios for tool usage, such as using one tool at a time or multiple tools across several interactions. The evaluation metrics are based on direct comparisons between the model's predictions and the correct answers, without relying on expensive APIs or human evaluations. Additionally, the authors created an instruction dataset called MTU-Instruct to help improve the tool-use capabilities of existing LLMs.

Why it matters?

This research is important because it enhances the way we evaluate AI models that need to interact with tools, making them more effective in practical applications. By providing a more robust and varied testing environment, MTU-Bench can lead to better training of LLMs, ultimately improving their performance in tasks like data analysis, automation, and problem-solving.

Abstract

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.

View Paper