Mixture of Contexts for Long Video Generation

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein

2025-08-29

Mixture of Contexts for Long Video Generation

Summary

This paper tackles the challenge of creating long videos with AI, focusing on how the AI can 'remember' what happened earlier in the video to maintain consistency and coherence over extended periods.

What's the problem?

Generating long videos is difficult because AI models need to keep track of a lot of information over a long time. Current AI models, specifically those using a technique called 'self-attention', struggle with this because the computational cost increases dramatically as the video gets longer, making it slow and resource-intensive to process and remember everything. Essentially, the AI gets bogged down and forgets details as the video progresses.

What's the solution?

The researchers propose a new method called 'Mixture of Contexts' (MoC). Instead of trying to remember everything all the time, MoC acts like a smart retrieval system. When the AI needs to understand a scene, it quickly finds and focuses on only the most important past events – like key objects, actions, or the overall description – along with some recent information. This selective 'memory' system prevents the AI from getting overwhelmed and allows it to efficiently process much longer videos. It also prevents the AI from getting stuck in loops by only looking forward in time.

Why it matters?

This work is important because it makes it more practical to generate high-quality, long-form videos using AI. By reducing the computational burden, the researchers enable the creation of videos that are minutes long and maintain consistency in characters, scenes, and actions. This opens up possibilities for AI-generated movies, detailed simulations, and other applications requiring extended video content.

Abstract

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

View Paper