Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, Jun Zhang
2026-04-06
Summary
This paper focuses on making video generation models faster and more efficient, specifically reducing the computational cost needed to create each frame of a video while still maintaining high quality.
What's the problem?
Currently, speeding up video generation often leads to blurry or unnatural-looking videos. Existing methods either over-smooth the motion, making it look artificial, or struggle to maintain consistency over longer video sequences, causing the video to 'drift' and lose quality as it progresses. The core issue is that these methods don't effectively ensure that the small changes made to each frame build up correctly over time to create a realistic and stable video.
What's the solution?
The researchers developed a new technique called Self-Consistent Distribution Matching Distillation, or SC-DMD. This method focuses on making sure that the changes applied to each frame consistently lead to a coherent final result. They also improved how the model uses its 'memory' (called the KV cache) during the generation process, training it to prioritize information from high-quality examples. This combined approach, dubbed Salt, ensures that each step in creating the video builds upon the previous one in a stable and realistic way.
Why it matters?
This work is important because it allows for real-time video generation, meaning videos can be created quickly enough to be used in applications like video calls, live streaming, or interactive experiences. By making these models more efficient, it opens up possibilities for creating high-quality video content on devices with limited computing power, like smartphones or laptops.
Abstract
Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed Salt, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at https://github.com/XingtongGe/Salt{https://github.com/XingtongGe/Salt}.