ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang

2025-01-20

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

Summary

This paper talks about a new tool called ComplexFuncBench, which is designed to test how well AI language models can handle complex tasks that involve using multiple functions and following specific rules. It's like creating a tough obstacle course for AI to see how smart and capable they really are when faced with real-world problems.

What's the problem?

AI language models are getting really good at understanding and generating text, but it's hard to know how well they can handle more complicated tasks that involve using different tools or functions in the right order. The current ways of testing AI don't really capture how tricky these tasks can be in the real world, where you might need to use multiple tools, follow certain rules, and deal with a lot of information at once.

What's the solution?

The researchers created ComplexFuncBench, which is like a super-advanced test for AI. This test includes five different real-world scenarios that require the AI to use multiple functions, follow specific rules, and handle a lot of information at once. They also made a special system called ComplexEval that can automatically grade how well the AI does on these tests. To make sure the test was fair and realistic, they carefully created 1,000 different examples of complex tasks for the AI to try.

Why it matters?

This matters because as we start using AI more in our daily lives, we need to make sure it can handle real-world tasks that are often more complicated than just answering simple questions. By creating this tough test, researchers can find out where current AI models struggle and figure out how to make them better. This could lead to AI assistants that are much more helpful and reliable in complex situations, like planning a trip or managing a project, where you need to use multiple tools and follow specific guidelines. Ultimately, this research helps push AI technology forward, making it more useful and trustworthy for everyone.

Abstract

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.

View Paper