Context Forcing: Consistent Autoregressive Video Generation with Long Context

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

2026-02-06

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Summary

This paper introduces a new method, called Context Forcing, for creating realistic and consistent long videos using artificial intelligence.

What's the problem?

Current methods for generating long videos rely on a 'teacher' AI that only looks at short clips to guide a 'student' AI that's trying to create the whole video. This is a problem because the teacher can't understand the overall story or maintain consistency over time since it doesn't have access to the past. It's like trying to write a novel with someone only giving you feedback on each sentence, without knowing what came before.

What's the solution?

The researchers solved this by creating a 'teacher' AI that *can* see the entire video history while guiding the 'student'. To make this possible with very long videos, they developed a clever system to manage the information the teacher needs to remember, focusing on the most important parts and reducing unnecessary details. This system is called a Slow-Fast Memory architecture.

Why it matters?

This work is important because it allows AI to generate much longer and more coherent videos than previously possible, extending the length of consistent videos by two to ten times compared to existing techniques. This is a big step towards creating AI that can produce realistic and engaging long-form content, like movies or extended stories.

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical student-teacher mismatch: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose Context Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a Slow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

View Paper