Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, Sinead Williamson

2025-04-21

Revisiting Uncertainty Quantification Evaluation in Language Models:
Spurious Interactions with Response Length Bias Results

Summary

This paper talks about how we measure uncertainty in language models—basically, how sure an AI is about its answers—and reveals that the usual ways of checking this can be unfair because of hidden biases, especially related to how long the answers are.

What's the problem?

The problem is that when researchers test how good language models are at knowing when they might be wrong, the methods they use can accidentally favor certain types of answers, like shorter or longer ones, which gives a misleading picture of how well the AI really understands its own uncertainty.

What's the solution?

The researchers took a closer look at these testing methods and found that using a language model itself as a judge for uncertainty is less biased than other approaches. This means it gives a more honest view of when the AI is truly unsure about its answers.

Why it matters?

This matters because if we want to trust AI to tell us when it might be making a mistake, especially in important situations, we need to make sure we're measuring its uncertainty in a fair and accurate way. This helps make AI safer and more dependable.

Abstract

Evaluations of Uncertainty Quantification (UQ) in language models are biased by correctness functions, affecting UQ methods' performance, with LLM-as-a-judge approaches identified as less biased.

View Paper