Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin
2025-09-03
Summary
This research paper investigates whether large language models (LLMs) are truly as sensitive to how a question is worded as people previously thought, or if the problem lies in how we *measure* their performance.
What's the problem?
It's been widely believed that LLMs are easily thrown off by even slight changes in the way a prompt is written – meaning if you rephrase a question, the LLM might give a very different answer. This 'prompt sensitivity' was seen as a major weakness. The core issue the paper addresses is whether this sensitivity is a real flaw in the LLMs themselves, or if it's because the methods we use to grade their answers are too strict and don't recognize correct answers expressed in different ways.
What's the solution?
The researchers tested 7 different LLMs (like GPT and Gemini) on 6 different tasks, using 12 different ways to phrase the same question. They compared traditional grading methods, which look for exact matches to expected answers, with a new method where another LLM acted as the judge. They found that the traditional methods showed a lot of variation in scores depending on the prompt wording. However, when an LLM judged the answers, the scores were much more consistent across different prompts, and the models performed better overall.
Why it matters?
This research suggests that LLMs might actually be more reliable and less sensitive to phrasing than we thought. The problem isn't necessarily that the models are bad at understanding, but that our current ways of evaluating them are too rigid and don't account for the fact that there's often more than one right way to say something. This means we might be underestimating the capabilities of these models and need to develop better evaluation techniques.
Abstract
Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.