Benchmarking Agentic Workflow Generation

Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

2024-10-13

Benchmarking Agentic Workflow Generation

Summary

This paper introduces WorFBench, a new benchmark designed to evaluate how well large language models (LLMs) can generate workflows for complex tasks.

What's the problem?

While large language models are great at handling various tasks, existing methods for evaluating their ability to create workflows often focus only on overall performance or have limitations like simple task structures and not covering enough scenarios. This makes it hard to understand how well these models can really plan and execute complex tasks.

What's the solution?

To solve this problem, the authors developed WorFBench, which provides a comprehensive way to assess workflow generation by using detailed scenarios and complex graph structures. They also created WorFEval, a system for evaluating how well LLMs generate workflows by comparing sequences and subgraphs. Through extensive testing, they found that there are significant differences in how well LLMs can plan sequences versus more complex graph-based tasks, with even advanced models like GPT-4 showing about a 15% gap in performance. They also trained two open-source models to see how well they could generalize their skills on new tasks.

Why it matters?

This research is important because it offers a better framework for understanding the capabilities of LLMs in generating workflows, which are essential for many real-world applications like project management and automated decision-making. By identifying strengths and weaknesses in these models, the findings can help improve future AI systems and make them more effective in handling complex tasks.

Abstract

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.

View Paper