Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu

2025-12-25

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Summary

This paper introduces a new way to train computers to understand and generate videos, building on recent successes in training them with text.

What's the problem?

Current methods for teaching computers to understand videos often miss important details about how things change over time. Many rely on techniques that don't fully grasp the sequence of events, like scrambling parts of a video and asking the computer to guess what's missing. Other attempts at generating videos haven't produced realistic or meaningful results, struggling with both understanding *what* to show and generating it clearly.

What's the solution?

The researchers developed a system called NExT-Vid. It works by showing the computer a portion of a video and then asking it to predict the *next* frame, essentially learning to anticipate what happens next. They improved this process in two key ways: first, they separated the computer’s understanding of the video’s content from the actual process of creating the next frame, and second, they used a technique called 'flow matching' to make the generated frames look more realistic and varied.

Why it matters?

This research is important because it leads to better computer vision systems. By improving how computers learn from and generate videos, we can expect advancements in areas like self-driving cars, video editing software, and even creating realistic virtual worlds. The new method consistently performs better than previous approaches at understanding video content, as shown through tests on various tasks.

Abstract

Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

View Paper