Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Lvmin Zhang, Maneesh Agrawala

2025-04-18

Packing Input Frame Context in Next-Frame Prediction Models for Video
Generation

Summary

This paper talks about FramePack, a new neural network method that helps computers generate videos by packing and compressing video frames so the model can handle more information at once and predict future frames more accurately.

What's the problem?

The problem is that when AI models try to generate videos, they need to look at a lot of frames at the same time to understand what's happening and predict what comes next. However, computer memory and processing limits make it hard to keep track of all these frames, which can hurt the quality and length of the videos the model can create.

What's the solution?

The researchers developed FramePack, which compresses the video frames in a smart way so that the model can fit more of them into its memory. This lets the AI see a bigger part of the video at once, use larger batches during training, and make better predictions for the next frames, leading to higher-quality video generation.

Why it matters?

This matters because it allows AI to create longer and more realistic videos, which is useful for everything from entertainment and animation to video editing and even scientific simulations.

Abstract

FramePack, a neural network for video generation, compresses frames to manage transformer context length and enhances video diffusion models with increased batch sizes and improved frame prediction.

View Paper