Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen

2025-02-12

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Summary

This paper talks about a new way to test how well AI language models really understand language, rather than just memorizing patterns. The researchers created a tool called C-BOD that changes the wording of test questions to see if AI models can still answer correctly.

What's the problem?

Many AI language models score really high on public tests, but these scores might be misleading. The models might just be recognizing patterns in how questions are asked, rather than truly understanding the meaning. This makes it hard to know how well these AI models would perform in real-world situations where questions might be phrased differently.

What's the solution?

The researchers developed C-BOD, which takes existing test questions and rephrases them in different ways without changing their meaning. They then used C-BOD to test 26 top AI language models on a benchmark called MMLU. This helped them see which models could still answer correctly when the questions were worded differently, and which ones struggled when the familiar patterns were changed.

Why it matters?

This matters because it shows we need to be more careful about how we judge AI language models. Just because a model scores high on a test doesn't mean it truly understands language. By using tools like C-BOD, we can create better, more reliable AI models that actually understand what they're reading or writing, rather than just memorizing patterns. This could lead to AI assistants and tools that are more flexible and useful in real-world situations where language can be unpredictable.

Abstract

Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.

View Paper