Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei

2025-01-16

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Summary

This paper talks about Ouroboros-Diffusion, a new way to make AI-generated videos that are longer and more consistent. It's like teaching a computer to tell a never-ending story through video, where everything flows smoothly and makes sense from start to finish.

What's the problem?

Current AI video generators, like FIFO-Diffusion, can make long videos from text descriptions, but they often mess up over time. It's like if you were telling a story and kept forgetting what the characters looked like or where they were supposed to be. This makes the videos look weird or choppy, especially in longer sequences.

What's the solution?

The researchers created Ouroboros-Diffusion, which is like giving the AI a better memory and planning skills. It does three main things: First, it uses a clever way to add new frames that match the overall look of the video. Second, it pays special attention to keeping characters and objects consistent throughout the video. Finally, it uses information from clearer parts of the video to help fix up the blurrier parts, kind of like using a good part of a photo to fix a smudged part.

Why it matters?

This matters because as we use AI more for creating videos, we want them to look as natural and consistent as possible, especially for longer videos. Ouroboros-Diffusion could help make AI-generated videos for movies, video games, or educational content that are more enjoyable to watch and easier to understand. It's a step towards AI that can create more realistic and coherent visual stories, which could open up new possibilities in entertainment, education, and even how we communicate ideas through video.

Abstract

The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

View Paper