PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
2026-03-30
Summary
This paper introduces a new technique called PackForcing to improve how AI generates long videos, making it more efficient and capable of creating higher quality content.
What's the problem?
Creating long videos with AI is really hard because the computer needs to remember everything that happened before to make the video consistent, and that memory requirement grows massively as the video gets longer. This leads to slow generation, repetition of scenes, and errors building up over time. Existing methods struggle to balance quality with the amount of computer memory needed.
What's the solution?
PackForcing solves this by smartly organizing the 'memory' the AI uses. It divides the past video information into three types: important 'anchor' frames kept in full detail, a highly compressed version of the middle parts of the video, and recent frames kept in full detail for smooth transitions. It also selectively remembers only the most important parts of the compressed middle section and adjusts for any information lost during compression. This allows the AI to generate a 2-minute video on a single powerful graphics card without running out of memory.
Why it matters?
This research is important because it makes long-form video generation with AI much more practical. It allows for the creation of high-quality, consistent videos with limited computing resources, and shows that you don't need to train the AI on extremely long videos to get good results. This opens the door to more accessible and powerful video creation tools.
Abstract
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing