AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, Samuel J. Bell

2025-06-16

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Summary

This paper talks about AbstentionBench, which is a new testing system made to see how well large language models (LLMs) know when not to answer questions they aren't sure about or that can't be answered. It looks at how these models deal with tricky questions where the answer might be missing, unclear, or impossible to know, across many different types of questions including math, science, and real-world scenarios.

What's the problem?

The problem is that even though LLMs are good at answering many questions, they often struggle to realize when a question doesn’t have a clear answer or is based on wrong or incomplete information. This means the models might give a confident but incorrect answer instead of saying they don’t know or can’t answer. Surprisingly, when these models are trained specifically to reason better, they sometimes get worse at knowing when to abstain from answering.

What's the solution?

The solution was to create AbstentionBench by collecting and modifying many datasets with a variety of unanswerable or uncertain questions. They built an automatic system to judge whether the model abstained correctly or not. They tested many top LLMs and carefully measured how often they correctly chose to abstain instead of guessing, and found that none performed very well at this task. They also showed that tweaking models to reason harder doesn't fix the problem and might even make it worse, though adding special instruction prompts can sometimes help a bit.

Why it matters?

This matters because in real life, it’s just as important for AI to know when not to answer as to answer correctly. If AI confidently gives wrong answers to unanswerable questions, it can cause serious problems, especially in fields like medicine or law. By showing that current models struggle with this and providing a way to measure and improve it, the paper helps guide future research to make AI safer and more reliable.

Abstract

AbstentionBench evaluates the ability of LLMs to abstain from answering uncertain or unanswerable questions, revealing that reasoning fine-tuning often degrades this capability.

View Paper