Fast-weight Product Key Memory

Tianyu Zhao, Llion Jones

2026-01-05

Summary

This paper introduces a new way for language models to remember information from long texts, improving their ability to understand and generate text with greater context.

What's the problem?

Current language models struggle with remembering details from very long pieces of text. Traditional methods either use a lot of computing power and memory to store everything, or they have limited memory and quickly forget earlier parts of the text. It's a balancing act between being able to store a lot of information and being efficient in how they process it.

What's the solution?

The researchers developed a system called Fast-weight Product Key Memory, or FwPKM. Think of it like a dynamic notebook that the model can quickly write in and read from while processing text. Unlike older systems, FwPKM doesn't just store information statically; it updates its 'notes' as it reads, allowing it to learn and remember new things on the fly. It does this by making small adjustments to its internal settings based on the text it's currently looking at.

Why it matters?

This is important because it allows language models to handle much longer texts without losing track of important details. The experiments showed that FwPKM can effectively remember information even when tested on texts four times longer than what it was trained on, which is a big step towards building models that can truly understand and work with complex, lengthy content.

Abstract

Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

View Paper