Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang
2025-09-05
Summary
This paper investigates a weakness in large language models (LLMs) – their tendency to stick with what they’ve already learned, even when told to do something different. It introduces a new way to test this, called Inverse IFEval, and shows that even the best LLMs struggle with it.
What's the problem?
LLMs are really good at many tasks, but they can be inflexible. They’ve been trained on tons of data, and they often have trouble when instructions go against the patterns they’ve already picked up. Imagine a model trained to always answer questions directly; it might struggle if you ask it to intentionally give a slightly wrong answer. This 'cognitive inertia' limits their usefulness in situations requiring adaptability or thinking outside the box.
What's the solution?
The researchers created a benchmark called Inverse IFEval to specifically measure this problem. It includes challenges like asking the model to correct a question that’s already correct, intentionally including errors in text and asking the model to use it, providing code without explanations, and asking questions based on false information. They then tested several leading LLMs using this benchmark, using another LLM to help judge the responses, and created a dataset of over a thousand questions in both Chinese and English.
Why it matters?
This research highlights that simply making LLMs fluent and factually correct isn’t enough. We also need to make them adaptable and able to overcome their initial training biases. Inverse IFEval provides a tool to diagnose this weakness and guide future development towards more reliable and flexible LLMs that can handle unexpected or unconventional requests in real-world scenarios.
Abstract
Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.