Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
2025-10-06
Summary
This paper explores a new way to build powerful language models without needing to make them incredibly huge, which is currently the standard practice.
What's the problem?
Current large language models are effective because they have a massive number of parameters, essentially storing a lot of information. However, this approach isn't ideal because most of that stored information isn't used for any single task, and these giant models are too big to run efficiently on smaller devices like phones or laptops. It's like trying to carry an entire encyclopedia around just to look up one fact.
What's the solution?
The researchers developed a system where a smaller language model works alongside a large 'memory bank' of information. Instead of storing *all* knowledge within the model itself, it selectively retrieves relevant information from this memory bank when needed. They also created a specific way to train this system so the memory bank holds more specialized, less common knowledge, while the smaller model focuses on general reasoning and common sense. Think of it like having a quick-thinking friend (the small model) who can quickly look up details in a detailed reference book (the memory bank).
Why it matters?
This research is important because it shows we can achieve similar performance to very large models with much smaller ones, making them more practical for use on a wider range of devices. It opens the door to running sophisticated AI applications on phones, embedded systems, and other devices with limited resources, and it's a step towards more efficient and accessible AI.
Abstract
The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.