Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
Mario Sanz-Guerrero, Minh Duc Bui, Katharina von der Wense
2025-09-19
Summary
This paper investigates how a small detail in how we ask questions to large language models – specifically, how we handle the space after 'Answer:' – can significantly impact the results we get when evaluating them.
What's the problem?
When testing large language models on multiple-choice questions, researchers usually add 'Answer:' to the end of the question to help the model pick the correct letter. However, there's no standard way to break down that 'Answer:' part into smaller pieces the model understands (called tokens). This paper shows that different ways of doing this tokenization can lead to surprisingly large differences in accuracy, up to 11%, and even change which models appear to be the best.
What's the solution?
The researchers tested different ways of tokenizing the space after 'Answer:'. They found that consistently including the space *with* the answer letter (like 'A ') worked best. This method not only improved accuracy across different models but also made the models better at correctly estimating how confident they are in their answers, which is important for trusting their results.
Why it matters?
This research highlights that even seemingly unimportant details in how we evaluate these powerful AI models can have a big effect on the conclusions we draw. It emphasizes the need for clear, standardized rules for testing so that different researchers can compare models fairly and reliably, and that we can trust the performance numbers we see.
Abstract
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter -- as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model's confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.