Distilling Feedback into Memory-as-a-Tool
Víctor Gallego
2026-01-12
Summary
This paper introduces a new way to make large language models (LLMs) better at tasks requiring reasoning, like giving feedback on student work, without making each response super slow and expensive.
What's the problem?
LLMs are great, but when you need them to *think* through a problem step-by-step, like explaining *why* an answer is wrong and how to fix it, it takes a lot of computing power and time. Existing methods to improve reasoning during use, called 'test-time refinement,' are slow and costly because they essentially re-do a lot of work for every new question.
What's the solution?
The researchers created a system where the LLM 'learns' from its own mistakes and saves helpful 'critiques' as guidelines in a kind of memory bank. When a new problem comes along, the LLM first checks this memory for similar situations and uses those past insights to quickly improve its answer. It's like studying old tests to prepare for a new one, and the LLM uses 'tool calls' to manage this memory efficiently. They tested this on a new dataset specifically designed for evaluating how well LLMs can give feedback based on a set of rules, called a rubric.
Why it matters?
This work is important because it allows LLMs to perform complex reasoning tasks much faster and cheaper than before. This means we can use these powerful models in more real-world applications, like automated tutoring or detailed report generation, without breaking the bank or waiting forever for a response. It shows a promising path towards making LLMs more practical and accessible.
Abstract
We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.