NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord
2025-11-27
Summary
This paper introduces a new method called Neighborhood Attention Filtering (NAF) for improving the detail in images processed by Vision Foundation Models (VFMs). VFMs are powerful AI systems that 'see' and understand images, but they often create lower-resolution versions internally, which can make it hard to perform tasks requiring precise detail.
What's the problem?
Vision Foundation Models work by initially simplifying images, essentially creating smaller, less detailed versions to process. When you need to go back to a high-resolution image for tasks like editing or restoring details, simply enlarging the simplified version doesn't work well. Existing methods either use quick, but basic, enlargement techniques or create very accurate enlargement methods that need to be specifically trained for *each* different Vision Foundation Model, which is time-consuming and inflexible.
What's the solution?
NAF solves this by cleverly focusing on the original, high-resolution image as a guide. It learns to intelligently weigh information from nearby pixels and considers their positions to reconstruct the detailed image. Importantly, it doesn't need to be retrained for different Vision Foundation Models – it works 'zero-shot,' meaning it can upscale images from any VFM without any additional learning. It uses a technique called Cross-Scale Neighborhood Attention and Rotary Position Embeddings to achieve this.
Why it matters?
This work is significant because it's the first method that can upscale images from *any* Vision Foundation Model better than methods specifically designed for a single model. This makes it much more practical and efficient. It also runs quickly enough to be used in real-time applications, and it's not just limited to upscaling; it can also improve image restoration tasks, showing its broad usefulness in image processing.
Abstract
Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.