Towards flexible perception with visual memory
Robert Geirhos, Priyank Jaini, Austin Stone, Sourabh Medapati, Xi Yi, George Toderici, Abhijit Ogale, Jonathon Shlens
2024-08-16

Summary
This paper discusses a new approach to improving how neural networks understand and remember visual information by using a flexible memory system.
What's the problem?
Training neural networks to recognize images is often a one-time process, similar to carving information into stone. Once trained, it's hard to change or update what the network knows without retraining it completely. This limits the ability of these models to adapt to new information or correct mistakes.
What's the solution?
The authors propose a method that combines the strengths of deep neural networks with a flexible visual memory system. This approach allows for easy addition and removal of data, making it possible to update the model's knowledge without starting from scratch. It also includes a way to interpret decisions made by the model, giving users more control over its behavior.
Why it matters?
This research is significant because it offers a more adaptable way for neural networks to learn from visual data. By creating a system that can easily update its knowledge, we can improve the performance of models in various applications, such as image recognition and computer vision, making them more useful in real-world situations.
Abstract
Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models -- beyond carving it in ``stone'' weights.