Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
2025-12-05
Summary
This paper tackles the challenges of creating long, realistic videos using a technique called autoregressive video diffusion, which generates videos frame by frame. They introduce a new method to improve the quality and consistency of these long videos without needing to retrain the existing models.
What's the problem?
While recent advances allow for generating videos in real-time, these videos often suffer from noticeable problems like repeating patterns, the video drifting off course from its intended path, and the motion appearing to slow down over time. Simply applying existing techniques designed for text generation to videos actually makes these problems *worse*, resulting in blurry or stagnant visuals.
What's the solution?
The researchers developed 'Deep Forcing,' which consists of two key ideas. First, 'Deep Sink' keeps a portion of the video's history 'locked' and constantly updates its position in time, helping to maintain a stable overall context throughout the video. Second, 'Participative Compression' intelligently discards unimportant parts of the video's memory (the KV cache) that aren't actively contributing to the current frame, preventing errors from building up as the video gets longer. Importantly, neither of these techniques require any additional training of the video generation model.
Why it matters?
This work is significant because it demonstrates that you can create much longer, higher-quality videos with better motion and consistency *without* the expensive and time-consuming process of retraining the underlying model. They achieved impressive results, extending video generation by over twelve times its original length while maintaining or even improving visual quality, showing that clever memory management can be just as effective as more complex training methods.
Abstract
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.