Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, Hao Tang
2024-10-15

Summary
This paper introduces Cavia, a new framework that allows users to create videos from images while controlling the camera angles and ensuring the videos look consistent from different viewpoints.
What's the problem?
While there have been great advancements in generating videos from images, existing methods struggle with two main issues: maintaining a consistent 3D appearance in the generated frames and allowing users to control the camera movements effectively. Many current techniques can only handle simple camera paths or fail to produce coherent videos when viewed from different angles.
What's the solution?
Cavia addresses these problems by using a new approach called view-integrated attention, which improves how the model understands both the viewpoint and the timing of movements in the video. This framework can take an input image and generate multiple videos that are consistent over time and across different camera angles. It also allows for training with various types of video data, including static scenes and dynamic objects, enhancing its flexibility.
Why it matters?
This research is important because it pushes the boundaries of video generation technology, making it easier for creators to produce high-quality videos that look realistic from multiple perspectives. By allowing precise control over camera movements while maintaining visual consistency, Cavia can be a valuable tool in fields like filmmaking, gaming, and virtual reality.
Abstract
In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: https://ir1d.github.io/Cavia/