The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke
2024-07-02

Summary
This paper talks about a new benchmark called SIFo that tests how well large language models (LLMs) can follow multiple instructions in a sequence, which is an important skill for these AI systems.
What's the problem?
The main problem is that evaluating LLMs on their ability to follow multiple instructions is difficult. There are issues like the instructions not being clear together, the order of instructions affecting how well the model performs, and a lack of tasks that can be easily checked for correctness.
What's the solution?
To tackle these challenges, the researchers created the SIFo benchmark. This benchmark includes various tasks where the success of following multiple instructions can be determined by looking at just the last instruction. It assesses instruction following through four specific tasks: modifying text, answering questions, solving math problems, and following security rules. By testing popular LLMs with this benchmark, they found that newer and larger models perform significantly better than older ones.
Why it matters?
This research is important because it highlights how well current LLMs can handle complex instructions, revealing weaknesses in their performance. Understanding these limitations helps researchers improve AI models, making them more reliable and effective in real-world applications.
Abstract
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.