When an LLM is apprehensive about its answers -- and when its uncertainty is justified
Petr Sychev, Andrey Goncharov, Daniil Vyazhev, Edvard Khalafyan, Alexey Zaytsev
2025-03-04
Summary
This paper talks about how to measure uncertainty in AI language models (LLMs) when they answer multiple-choice questions. The researchers looked at different ways to tell when an AI might be unsure about its answers and whether these methods work well for different types of questions.
What's the problem?
It's important to know when AI models might be giving incorrect answers, especially for important decisions. Current methods for measuring AI uncertainty focus on specific types of uncertainty and might not work well for all kinds of questions.
What's the solution?
The researchers tested two main ways of measuring uncertainty: looking at how confident the AI is in each word of its answer (token-wise entropy) and asking the AI to judge its own answers (model-as-judge). They tested these methods on different AI models and various topics like biology and math. They found that measuring word confidence works well for factual questions but not for questions that need more reasoning.
Why it matters?
This matters because understanding when AI is uncertain helps us trust its answers more appropriately. It shows that we need different ways to measure uncertainty for different types of questions. This research could help make AI systems safer and more reliable, especially when they're used for important decisions in fields like medicine or finance.
Abstract
Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is 0.55. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.