RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab

2025-10-17

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Summary

This paper investigates how well large language models, specifically those used in systems that retrieve information to help answer questions (called RAG systems), can correctly *avoid* answering when the information they're given is bad or misleading.

What's the problem?

Currently, even the most advanced language models often struggle to recognize and refuse to answer questions when the context they receive is flawed. They either confidently give wrong answers based on bad info, or they become overly cautious and refuse to answer even when they could provide a correct response. Existing tests for this ability aren't very reliable because models can 'cheat' by memorizing the test questions or finding patterns specific to the test data, rather than actually understanding when to refuse. Basically, we don't have a good way to reliably check if these models are safe and won't spread misinformation.

What's the solution?

The researchers created a new method called RefusalBench to automatically generate challenging test cases. This method intentionally introduces different types of uncertainty and errors into the information given to the model, varying how much and what kind of 'bad' information is included. They tested over 30 different models using these new tests and found that the ability to refuse to answer is actually two separate skills – first, detecting something is wrong, and second, categorizing *why* it's wrong. They also discovered that simply making the models bigger or giving them more reasoning ability doesn't automatically fix this problem. Importantly, they showed that this refusal skill can be improved through training, suggesting it's something models can learn.

Why it matters?

This research is important because it highlights a significant safety issue with language models. If these models can't reliably refuse to answer based on flawed information, they could easily spread misinformation or provide harmful advice. By creating a better way to test and improve this ability, the researchers are helping to make these powerful AI systems more trustworthy and safe for everyone. They've also provided tools for others to continue evaluating and improving these models in the future.

Abstract

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

View Paper