Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Minseok Seo, Mark Hamilton, Changick Kim

2025-11-25

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Summary

This paper introduces a new technique called Upsample Anything, which focuses on improving the detail in images processed by large AI models without needing to retrain those models.

What's the problem?

Large, powerful AI models for vision tasks often reduce the resolution of images internally to make processing faster. While this works well for general understanding, it makes it difficult to perform tasks that need precise, pixel-level detail, like accurately outlining objects in a picture or creating detailed depth maps. Existing methods to restore this detail either require retraining the AI model with new data, which is time-consuming and doesn't work well on different types of images, or they are computationally expensive and slow.

What's the solution?

Upsample Anything solves this by cleverly optimizing the image *after* the AI model has processed it, but *before* the final output. It learns a special 'kernel' – think of it like a smart filter – that enhances details based on both the location of pixels and their color or intensity. This kernel is unique to each image, but the process of finding it is fast and doesn't require any training. It essentially bridges two existing techniques, Gaussian Splatting and Joint Bilateral Upsampling, to achieve this.

Why it matters?

This is important because it allows us to get high-resolution, detailed results from existing AI models without the cost of retraining or slow processing. It works well across different types of AI models and different kinds of image tasks, like identifying objects, estimating depth, and creating probability maps, making it a versatile tool for improving image analysis.

Abstract

We present Upsample Anything, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only approx0.419 s per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. Project page: https://seominseok0429.github.io/Upsample-Anything/{https://seominseok0429.github.io/Upsample-Anything/}

View Paper