Generative Neural Video Compression via Video Diffusion Prior

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma

2025-12-05

Generative Neural Video Compression via Video Diffusion Prior

Summary

This paper introduces a new way to compress videos using artificial intelligence, specifically a system called GNVC-VD. It aims to make videos smaller in file size while still looking good, even at very low bitrates.

What's the problem?

Current video compression methods that use AI often treat each frame of a video separately. While they can add detail back into a compressed video, this can cause a flickering effect because the AI isn't considering how frames relate to each other over time. Essentially, the video looks choppy and unnatural, especially when heavily compressed.

What's the solution?

GNVC-VD solves this by using a powerful AI model designed for *generating* videos, not just restoring details. It doesn't just look at each frame individually; it considers the entire sequence of frames together. The system starts with a compressed version of the video and then refines it, making adjustments that ensure consistency between frames and reduce flickering. It's like taking a slightly blurry video and intelligently smoothing out the transitions between each scene.

Why it matters?

This research is important because it shows a promising new direction for video compression. By integrating AI that understands how videos naturally flow, we can achieve much better quality at smaller file sizes. This is crucial for things like streaming videos online, storing large video libraries, and even future video technologies where high quality and efficient compression are essential.

Abstract

We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

View Paper