Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
2025-01-20

Summary
This paper talks about how artificial intelligence models called Large Language Models (LLMs) become more confident in their answers to multiple-choice questions when they explain their reasoning, even if their answers are wrong. The researchers tested this on seven different AI models using a wide range of topics.
What's the problem?
When evaluating how well AI models perform, researchers often use multiple-choice questions and look at how confident the AI is in its answers. However, it wasn't clear if the way we ask the AI to answer these questions affects its confidence. The problem was figuring out if there's a difference in the AI's confidence when it just gives an answer versus when it explains its thinking before answering.
What's the solution?
The researchers conducted a study where they asked AI models to answer multiple-choice questions in two different ways. In one method, the AI just picked an answer. In the other method, the AI had to explain its reasoning before giving an answer. They then compared how confident the AI was in its answers for both methods. They did this across seven different AI models and used questions on many different topics to get a comprehensive view.
Why it matters?
This research matters because it shows that the way we ask AI to answer questions can affect how sure it is about its answers, even when it's wrong. This is important for a few reasons. First, it helps us understand how to better evaluate AI systems, which is crucial as we rely on them more in our daily lives. Second, it reveals a similarity between AI and humans, as people also tend to be more confident when they explain their answers. Lastly, it highlights that we need to be careful when using AI's confidence as a measure of its accuracy, as high confidence doesn't always mean the answer is correct. This knowledge can help us develop more reliable AI systems and use them more effectively in the future.
Abstract
One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.