In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

2025-12-18

In Pursuit of Pixel Supervision for Visual Pre-training

Summary

This paper explores how well computers can learn to 'see' and understand images by simply looking at a huge number of pictures from the internet, without needing humans to label them.

What's the problem?

Traditionally, teaching computers to understand images requires someone to tell the computer what's *in* the image – like labeling a picture as 'cat' or 'dog'. This takes a lot of time and effort. The researchers wanted to see if a computer could learn useful information about images just by looking at the pixels themselves, without needing those labels, and if this method could still be effective compared to more modern approaches.

What's the solution?

The researchers created a new system called 'Pixio', which is a type of program called an autoencoder. Think of it like a computer trying to compress and then reconstruct an image. By forcing the computer to rebuild the image, it learns important features. Pixio is special because it's trained to reconstruct images even when parts of them are hidden, making it work harder to understand the whole picture. They trained Pixio on 2 billion images scraped from the web, and used a clever method to automatically filter out bad images without much human help.

Why it matters?

This work shows that learning directly from pixels is still a really good way to teach computers to understand images. Pixio performs as well as, or even better than, other advanced systems that require more complex training methods. This means we might not always need massive, human-labeled datasets to build powerful computer vision systems, which could make it easier and cheaper to develop AI that can 'see' the world around us.

Abstract

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

View Paper