Streaming Autoregressive Video Generation via Diagonal Distillation

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu

2026-03-11

Streaming Autoregressive Video Generation via Diagonal Distillation

Summary

This paper focuses on making high-quality video generation faster, specifically for things like real-time streaming. It tackles the challenge of taking powerful but slow video creation models and making them efficient enough to run quickly without losing quality.

What's the problem?

Currently, creating realistic videos requires a lot of computing power. While there are methods to speed things up, like 'diffusion distillation,' they often borrow techniques from image generation which don't fully consider how videos change over time. This leads to videos that look good at first but become blurry or have unnatural motion as they get longer, and there's usually a trade-off between speed and quality. The main issues are not using enough information about previous frames when simplifying the model, and the model making incorrect assumptions about future noise levels during video creation.

What's the solution?

The researchers developed a new technique called 'Diagonal Distillation.' This method focuses on better utilizing the information from previous frames in a video and improving how the model predicts future noise. It works by generating the video in stages, using more processing steps at the beginning to create a detailed foundation, and then fewer steps later on, relying on the initial quality. They also incorporated a way to estimate how objects are moving (optical flow) to maintain realistic motion even with fewer processing steps. Essentially, they're building a strong base and then efficiently adding details, while also making sure the model's predictions are accurate throughout the process.

Why it matters?

This work is important because it significantly speeds up video generation without sacrificing quality. They achieved a massive speedup – over 277 times faster than the original, slow model – while still producing a 5-second video in a reasonable amount of time. This opens the door for real-time applications of high-quality video generation, like live streaming with advanced effects or creating interactive video experiences.

Abstract

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

View Paper