Next-Embedding Prediction Makes Strong Vision Learners
Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu
2025-12-19
Summary
This paper explores a new way to teach computers to 'see' and understand images, inspired by how well similar techniques work with language. It focuses on building models that can predict what comes next in an image, rather than just labeling what's already there.
What's the problem?
Traditionally, teaching computers to understand images requires a lot of labeled data or complex methods like reconstructing the original image from parts of it. These methods can be difficult to scale up and don't always capture the most important information for understanding the image's content. The goal is to find a simpler, more effective way to learn from unlabeled images.
What's the solution?
The researchers developed a method called Next-Embedding Predictive Autoregression, or NEPA. Essentially, they trained a computer model to predict the 'embedding' – a numerical representation – of the next part of an image, given the previous parts. Think of it like predicting the next word in a sentence. They used a standard 'Transformer' architecture, similar to those used in language models, and only focused on this prediction task, without needing to rebuild the image or compare different parts. They used a technique called 'causal masking' to ensure the model only used past information to make predictions.
Why it matters?
This work shows that you can achieve strong image understanding capabilities with a surprisingly simple approach. By focusing on predicting what comes next, the model learns useful representations of images without needing complex training procedures or large amounts of labeled data. This could lead to more efficient and versatile computer vision systems that can be applied to various tasks, and potentially even extended to other types of data like video or audio.
Abstract
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.