StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
2025-02-24
Summary
This paper talks about StructFlowBench, a new way to test how well AI language models can understand and follow instructions in conversations that have multiple back-and-forth exchanges.
What's the problem?
Current tests for AI language models mostly focus on how well they follow single instructions or specific rules, but they don't really check if the AI can understand how different parts of a conversation connect to each other. This is important because real conversations often build on previous things that were said.
What's the solution?
The researchers created StructFlowBench, which looks at six different ways that parts of a conversation can relate to each other. This new test not only checks if the AI follows instructions but also if it understands how different parts of the conversation flow together. They used this to test 13 different AI models, both open-source and closed-source.
Why it matters?
This matters because it helps us see how well AI can handle real-world conversations, which often involve building on previous statements or referring back to earlier topics. The test showed that current AI models still have trouble fully understanding these complex conversation structures. By identifying these weaknesses, researchers can work on making AI better at having more natural, flowing conversations with humans.
Abstract
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.