Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang

2025-12-05

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Summary

This paper focuses on making it possible to generate long, realistic videos quickly, a key need for things like virtual reality and interactive simulations.

What's the problem?

Current methods for quickly generating videos rely on using the first few frames as a reference point throughout the entire video. While this helps keep things consistent, it often leads to the video just repeating those initial images or having very limited movement, making the generated videos look unnatural and static.

What's the solution?

The researchers introduced a new system called 'Reward Forcing' with two main ideas. First, they created 'EMA-Sink' which is like a constantly updating memory of the beginning frames, but it doesn't just stick to the original images – it blends in information from newer frames as the video progresses. This prevents the video from getting stuck on the initial frames. Second, they developed 'Re-DMD', a way to train the video generator to focus on the parts of the video that are actually *changing* and have a lot of motion, using a vision-language model to identify those dynamic areas and prioritize them during training.

Why it matters?

This work is important because it allows for the creation of high-quality, dynamic videos much faster than before, achieving a speed of over 23 frames per second on powerful hardware. This opens the door to more realistic and interactive virtual worlds and simulations, as well as potentially improving video editing and creation tools.

Abstract

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

View Paper