BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang

2025-12-03

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Summary

This paper focuses on creating realistic, long videos – specifically, videos that are a full minute in length – as a way to help build more advanced AI systems that can understand and interact with the world. It introduces a new system called BlockVid to improve the quality and consistency of these generated videos.

What's the problem?

Generating long videos is hard because current methods struggle with two main issues. First, as the video gets longer, small errors build up over time, making the video look unrealistic. This is due to how the AI 'remembers' previous parts of the video. Second, there weren't good ways to actually *measure* how well a long video holds together and makes sense as a whole, or good datasets to test these systems on.

What's the solution?

The researchers developed BlockVid, which tackles these problems in a few ways. It uses a smarter way to 'remember' past video frames, focusing on the most important parts to avoid error buildup. They also created a new training method to help the AI learn to create consistent videos, and a new way to add noise during training to improve quality. Finally, they released a new dataset and set of tools to evaluate long videos, allowing for better comparison of different AI systems.

Why it matters?

This work is important because creating realistic, long videos is a key step towards building AI that can truly understand and interact with the world around us. Better video generation leads to better AI simulators, which can be used for training robots, developing virtual reality experiences, and much more. BlockVid represents a significant improvement in the field, achieving noticeably better results than previous methods.

Abstract

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.

View Paper