4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee

2024-12-06

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Summary

This paper talks about 4Real-Video, a new system designed to create realistic 4D videos that can show different viewpoints and timeframes, making video content more dynamic and engaging.

What's the problem?

Creating high-quality videos that look realistic from multiple angles and maintain consistency over time is challenging. Current methods often struggle with synchronizing the visuals and ensuring that the video appears natural, especially when changing viewpoints or showing motion.

What's the solution?

The authors developed a framework called 4Real-Video, which organizes video frames into a grid where each row represents a moment in time and each column represents a viewpoint. They introduced a two-stream architecture: one stream updates the viewpoint while the other updates the timing of the frames. To keep these two streams working together smoothly, they added a synchronization layer that allows them to share information. This approach improves video quality, speed, and consistency compared to previous methods.

Why it matters?

This research is important because it advances the technology for generating videos that are not only visually appealing but also flexible in terms of how they can be viewed. By improving how videos are created, 4Real-Video could enhance applications in entertainment, virtual reality, and education, making it easier to create immersive experiences that engage viewers.

Abstract

We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).

View Paper