Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

Vaishnavi Shrivastava, Ananya Kumar, Percy Liang

2025-02-04

Language Models Prefer What They Know: Relative Confidence Estimation
via Confidence Preferences

Summary

This paper talks about a new way to measure how confident AI language models are in their answers. Instead of asking the AI to rate its confidence on a scale, the researchers propose comparing pairs of questions to see which one the AI feels more sure about answering correctly.

What's the problem?

Current methods of asking AI models to rate their own confidence don't work well. The AI often gives similar high scores for most questions, which doesn't help users know when the AI might be wrong or when they should double-check its answers with a human expert.

What's the solution?

The researchers created a system called relative confidence estimation. This method asks the AI to compare two questions and decide which one it's more confident about answering correctly. By doing this many times with different pairs of questions, they can use special ranking methods (like those used in chess ratings) to give each question a confidence score. They tested this new method on five advanced AI models across 14 different types of challenging questions.

Why it matters?

This matters because as AI becomes more common in our daily lives, we need to know when we can trust its answers and when we should be cautious. The new method helps create more reliable confidence scores, which could make AI safer and more useful in important areas like healthcare, education, and scientific research. By improving confidence estimates by 3.5% on average, this research helps us better understand when AI is likely to be right or wrong.

Abstract

Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence ("Score your confidence from 0-1.") is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence ("Which question are you more confident in answering correctly?"). Treating each question as a "player" in a series of matchups against other questions and the model's preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model's confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs -- GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B -- across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.

View Paper