On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero

2025-09-08

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Summary

This paper investigates how well large language models, or LLMs, perform when questions aren't asked in the exact same way they are in standard tests. It questions whether current tests accurately reflect a model's true abilities when faced with real-world variations in how people ask things.

What's the problem?

Currently, LLMs are tested using benchmarks with questions that are always worded the same way. However, in real life, people phrase questions differently – they paraphrase. The problem is that we don't know if LLMs can handle these different ways of asking the same question, and if the good scores they get on benchmarks actually mean they're reliable in practical situations. Essentially, are these models good at understanding the *meaning* of a question, or just recognizing specific wording?

What's the solution?

The researchers took six common LLM benchmarks and created many different paraphrases of each question. Then, they tested 34 different LLMs on both the original questions *and* the paraphrased versions. By comparing the results, they could see how much the models' performance changed when the questions were reworded. This allowed them to measure how robust the models were to linguistic variation.

Why it matters?

The findings show that while the *ranking* of models generally stayed the same, their actual scores dropped significantly when questions were paraphrased. This is important because it suggests that LLMs aren't as good at generalizing as we thought, and that high scores on current benchmarks might be misleading. It highlights the need for better testing methods that include variations in wording to more accurately assess how well these models will perform in the real world.

Abstract

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

View Paper