Mitigating Label Length Bias in Large Language Models
Mario Sanz-Guerrero, Katharina von der Wense
2025-11-19
Summary
This paper investigates a problem with how large language models (LLMs) make predictions when choosing from a list of options, specifically focusing on how the length of the answer choices affects their accuracy.
What's the problem?
LLMs are really good at learning with limited examples, but when asked to pick the best answer from a set of choices, they tend to favor answers that are longer. This isn't because longer answers are actually better, but because of a built-in bias in how the model works. Existing methods to fix these kinds of biases don't account for this issue of answer length, meaning the models still aren't making fair comparisons, even after attempts to normalize the data.
What's the solution?
The researchers developed a new technique called normalized contextual calibration (NCC). This method focuses on making sure the model evaluates each complete answer choice fairly, regardless of how many words it contains. It essentially recalibrates the model's confidence scores to account for the length of the answer. They tested NCC on several different datasets and with different LLMs, and it consistently improved performance.
Why it matters?
This work is important because it shows that simply normalizing data isn't enough to eliminate biases in LLMs. Addressing biases in how models handle multi-word answers is crucial for making these models more reliable and accurate, especially when used in real-world applications like answering questions where the correct answer isn't always the longest or shortest option. It also means we can get better results with fewer example questions when teaching the model.
Abstract
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.