MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee
2026-03-28
Summary
This paper introduces a new way to improve how computer vision models 'see' images, focusing on using different levels of detail to get a more complete understanding.
What's the problem?
Current computer vision models, even the really good ones, usually look at an image at only one size during use, even though they might have been trained on different sizes. This is a problem because our own vision uses different levels of detail – we see the big picture and the small details, and both are important. Ignoring different resolutions means the model misses out on useful information, like recognizing a general object versus identifying specific features.
What's the solution?
The researchers developed a method called Multi-Resolution Fusion, or MuRF. It works by feeding the same image into the vision model at multiple resolutions (different sizes). The model then processes each version and combines the information it gets from all of them. Importantly, this doesn't require any further training of the original model; it's an add-on that improves performance right away.
Why it matters?
This is important because MuRF is a simple way to significantly boost the performance of existing computer vision models without needing to retrain them. It works with many different types of models, making it a broadly applicable improvement for a wide range of tasks like image recognition and object detection, ultimately leading to more accurate and reliable computer vision systems.
Abstract
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.