Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
2025-01-08

Summary
This paper talks about a new AI system called Diffusion as Shader (DaS) that can create and control videos in many different ways, like changing camera angles or editing content, all using a single tool.
What's the problem?
Current AI systems that make videos are good at creating them from text or images, but they're not great at controlling specific aspects of the video, like moving the camera or changing objects. Most existing tools can only do one type of control at a time, which limits their usefulness.
What's the solution?
The researchers created DaS, which uses something called '3D tracking videos' to control how videos are made. These 3D tracking videos help DaS understand the three-dimensional nature of what's happening in a scene. This allows DaS to control many different aspects of a video at once, like changing camera angles, transferring motion from one video to another, or manipulating objects in the video.
Why it matters?
This matters because it could make creating and editing videos much easier and more flexible. Filmmakers, game developers, and even regular people making videos for social media could use this technology to create complex, high-quality videos with less effort. It could lead to new ways of storytelling in movies and games, or help create more realistic virtual reality experiences. The fact that DaS can do so many different things with just one system also means it could save time and resources in video production.
Abstract
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.