Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona
2026-02-19
Summary
This research investigates how well large language models (LLMs) actually 'know' things, and why they sometimes get facts wrong. It moves beyond simply checking if an answer is right or wrong to understand *how* the model fails – is the information missing from its training, or does it know the information but struggle to find it when needed?
What's the problem?
Current methods for evaluating LLMs treat all factual errors the same way, making it hard to pinpoint the root cause. Are models failing because they were never taught a piece of information (like an empty shelf), or because they *have* been taught it but can't access it when asked (like losing the key to a storage unit)? This makes it difficult to improve models effectively because we don't know what to focus on – adding more data or improving how the model uses the data it already has.
What's the solution?
The researchers created a new way to test LLMs called WikiProfile. This method doesn't just look at whether the model answers a question correctly, but also tries to figure out if the model even *has* the information needed to answer. It categorizes facts as either 'encoded' (the model should know it) or 'not encoded'. If encoded, it then checks if the model can recall the fact directly, or if it needs to 'think' (perform more complex processing) to retrieve it. They tested 13 different LLMs, including very advanced ones like GPT-5 and Gemini-3, using over 4 million questions.
Why it matters?
The study found that the most advanced models already 'know' a huge amount of information – almost everything on their benchmark. However, the biggest problem isn't a lack of knowledge, but a difficulty in *accessing* that knowledge. This suggests that simply making models bigger and training them on more data might not be the best path forward. Instead, future improvements should focus on helping models better utilize the information they already possess, perhaps by improving their reasoning abilities or search mechanisms.
Abstract
Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.