A Single Character can Make or Break Your LLM Evals
Jingtong Su, Jianyu Zhang, Karen Ullrich, Léon Bottou, Mark Ibrahim
2025-10-09
Summary
This paper investigates how the simple way we format examples given to large language models (LLMs) – things like commas, newlines, or hashtags separating them – surprisingly impacts the quality of the model's answers.
What's the problem?
LLMs are often given examples to show them *how* to answer a question, and researchers have looked at *how many* examples to use. However, no one really studied if the *formatting* of those examples mattered. The problem is that this formatting choice, which seems minor, can actually cause big differences in how well the model performs, and even change which model appears to be the 'best'.
What's the solution?
The researchers tested different ways to separate examples (different 'delimiters') across several popular LLMs like Llama, Qwen, and Gemma. They found that performance on tests like MMLU could change by as much as 23% just by switching from a comma to a newline! They also looked inside the model to see *why* this happens, discovering that good delimiters help the model focus on the important parts of the input. Finally, they figured out ways to make the models less sensitive to the delimiter used, like specifically mentioning the delimiter in the prompt itself.
Why it matters?
This research is important because it shows that LLMs are surprisingly fragile and easily influenced by superficial things like formatting. This means that when evaluating or using these models, we need to be careful about how we present examples. It also suggests that making LLMs more robust to these kinds of changes is an important area for future research, and provides practical advice on which delimiters work best.
Abstract
Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.