GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent
Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev
2026-03-18
Summary
This paper explores a way to make large language models, like those used for chatbots, better at remembering and using information from very long conversations or documents without using a huge amount of computer memory.
What's the problem?
Large language models need to consider a lot of past information to give good answers, but storing all of that information – every word from the conversation so far – takes up a ton of memory. This is especially true with a common technique called 'Transformers' which creates a 'cache' of past information for each layer of the model. It's like trying to remember everything ever said to you, word for word, which is impossible for both people and computers!
What's the solution?
The researchers developed a method called 'GradMem' which is a smarter way to store information. Instead of just saving everything, GradMem figures out the *most important* parts of the information and compresses them into a smaller, more manageable form. It does this by slightly adjusting a small set of 'memory tokens' using a process similar to how a model learns, called gradient descent, but without changing the main model itself. It's like taking notes on a long lecture instead of trying to memorize every sentence. This process focuses on reconstructing the original information, correcting errors as it goes, making the memory more reliable.
Why it matters?
This is important because it allows language models to handle much longer inputs and conversations without needing massive amounts of memory. This means these models could become more powerful and useful in real-world applications like complex question answering, summarizing long documents, or having more in-depth conversations, all while being more efficient and potentially cheaper to run.
Abstract
Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.