Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

William Jurayj, Jeffrey Cheng, Benjamin Van Durme

2025-02-20

Is That Your Final Answer? Test-Time Scaling Improves Selective Question
Answering

Summary

This paper talks about improving how AI language models answer questions by giving them more time to think and letting them decide when they're confident enough to answer. It's like teaching a student not just to answer every question, but to know when they're sure about their answer and when it's better to say 'I'm not sure.'

What's the problem?

Current ways of testing AI models assume they should always give an answer, even if they're not confident. This isn't realistic and can lead to wrong or inappropriate answers. It's like forcing a student to guess on every question in a test, even when they have no idea what the answer is.

What's the solution?

The researchers found a way to measure how confident the AI is in its answers. They gave the AI more time to think about questions and found that this not only helped it answer more questions correctly, but also made it more confident about its correct answers. They also suggested new ways to test AI that allow for some level of risk in answering, which is more like real-world situations.

Why it matters?

This matters because it could make AI systems more reliable and trustworthy in real-world applications. By knowing when they're confident and when they're not, AI assistants could give more accurate information and avoid spreading misinformation. It's a step towards creating AI that knows its own limitations, which is crucial for using AI safely and effectively in important areas like healthcare, education, or decision-making.

Abstract

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

View Paper