LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacs, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
2025-03-07
Summary
This paper talks about LINGOLY-TOO, a new way to test how well AI language models can truly reason, rather than just memorize information
What's the problem?
It's hard to tell if AI language models are actually reasoning or just recognizing patterns they've seen before. This makes it easy to overestimate how smart these models really are
What's the solution?
The researchers created LINGOLY-TOO, which uses linguistic puzzles with scrambled writing systems. This makes it harder for AI to rely on memorized information. They tested top AI models with these puzzles to see how well they could solve problems they hadn't seen before
Why it matters?
This matters because it helps us understand the true capabilities of AI language models. By showing that even advanced AI struggles with these puzzles, it reveals that current AI might not be as good at reasoning as we thought. This knowledge can guide future AI development and help set more realistic expectations for what AI can do
Abstract
Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.