Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang

2025-10-29

Uniform Discrete Diffusion with Metric Path for Video Generation

Summary

This paper introduces a new method called URSA for creating videos, focusing on improving how we generate videos using a step-by-step, building-block approach with individual visual elements.

What's the problem?

Traditionally, generating videos by breaking them down into discrete parts, like individual images or visual tokens, has been difficult because errors build up over time and it's hard to maintain consistency throughout a longer video. Existing methods often struggle with creating high-resolution, long videos efficiently, while methods that work well usually treat video as a continuous process, which isn't always ideal.

What's the solution?

The researchers developed URSA, which stands for Uniform discRete diffuSion with metric pAth. It works by gradually refining these discrete visual elements over time. Two key ideas make this possible: first, a way to measure the 'distance' between different states of the video that simplifies the refinement process, and second, a technique to adjust the speed of refinement based on the video's resolution. They also created a way to train the model to do multiple things, like creating videos from images or filling in missing frames, all within the same model.

Why it matters?

URSA is important because it shows that we can create high-quality, long videos using a discrete approach, matching the performance of more complex continuous methods. This means we can potentially build more efficient and flexible video generation systems, and the code being made available allows others to build upon this work.

Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

View Paper