Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

2025-12-30

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Summary

This paper introduces a new method called Stream-DiffVSR for improving the resolution of videos, making them look sharper and more detailed. It focuses on doing this quickly enough to be used in real-time applications, unlike previous methods that took a long time.

What's the problem?

Existing methods for improving video resolution using diffusion models, while producing high-quality results, are too slow for things like live streaming or video conferencing. They need to see future frames in the video to work well, and the process of cleaning up the image (denoising) takes many steps, causing a significant delay. This makes them impractical for situations where you need an immediate, high-quality image.

What's the solution?

The researchers developed Stream-DiffVSR, which only looks at past frames to predict future ones, avoiding the need to wait for information that hasn't happened yet. They sped up the image cleaning process by using a simplified, four-step method. They also added a module that uses motion information to guide the improvement and a decoder that focuses on making the video look smooth and detailed over time. Essentially, they streamlined the entire process to make it much faster.

Why it matters?

This work is important because it's the first diffusion-based video resolution improvement method that's fast enough for real-time use. It dramatically reduces the delay from over 4600 seconds to just 0.328 seconds on a powerful graphics card, while still providing better image quality than other fast methods. This opens the door to using these advanced techniques in applications where speed is critical, like live video calls or gaming.

Abstract

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

View Paper