Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

2026-01-01

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Summary

This paper introduces PFP, a new type of neural network designed to efficiently compress long videos into much shorter summaries, while still keeping the important visual details.

What's the problem?

Long videos contain a lot of information, and processing all of it can be computationally expensive and require a lot of memory. Existing methods struggle to compress videos without losing crucial details, especially the fine textures and sharp edges that make images look realistic. The challenge is to create a short 'context' representing the video that allows you to reconstruct or understand individual frames without needing to store the entire video.

What's the solution?

The researchers created PFP, which first 'pretrains' by learning to specifically preserve the high-frequency details – those fine visual elements – in individual frames taken from anywhere in the video. This pretraining helps PFP create a compressed representation, about 5,000 units long for a 20-second video, that can be used to recreate those frames with good visual quality. Then, this compressed representation can be further adjusted, or 'fine-tuned', to work with other video processing models that need to remember what happened earlier in the video.

Why it matters?

This work is important because it offers a way to handle long videos more efficiently. By compressing videos while retaining important details, PFP can reduce the amount of computing power and memory needed for tasks like video editing, analysis, and creating AI systems that understand video content. It allows for longer 'memory' in video AI without a huge increase in processing costs, which is a significant step forward.

Abstract

We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

View Paper