Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

2026-04-22

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Summary

This paper investigates whether large language models, like those powering chatbots, can automatically create complex, working processes for businesses, instead of relying on people to build them by hand.

What's the problem?

Currently, building these automated business processes, called visual workflows, is a very manual and difficult task. Developers have to design each step, write instructions for each part, and constantly fix errors as needs change. This takes a lot of time, money, and is prone to mistakes because it's all done by humans.

What's the solution?

The researchers created a new testing ground called Chat2Workflow, which contains real-world business processes. They then developed a system that uses a large language model to try and automatically generate these workflows just from a description in plain language. They also built a system to help the language model fix errors that occur when the workflow is run. While the language model can understand the general idea, it often struggles to create workflows that actually work correctly, especially when things get complicated.

Why it matters?

This research is important because it highlights the potential, but also the current limitations, of using AI to automate the creation of essential business processes. The Chat2Workflow testing ground provides a way for researchers to continue improving these AI systems and move closer to fully automated, industrial-strength workflow creation.

Abstract

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

View Paper