VideoDiT

Unlike traditional approaches that simply add temporal layers to image diffusion models, VideoDiT introduces a novel approach to video encoding and generation. The framework uses a DP-VAE that encodes key frames of a video using the original 2D VAE, while non-key frames are compressed with a 3D VAE, ensuring efficient spatiotemporal modeling. This unique combination allows for seamless transfer of knowledge from pre-trained image diffusion models to video generation tasks. The use of 3D positional embeddings and the extension of 2D attention mechanisms into 3D space enable VideoDiT to model complex video dynamics with negligible increase in computational overhead.

VideoDiT supports joint image-video training, preserving the spatial modeling capabilities of the base image generation model while excelling in both static and dynamic content creation. This dual capability allows users to generate high-fidelity videos and images within a unified framework, streamlining workflows for applications such as content creation, animation, and synthetic data generation. Extensive experiments have validated VideoDiT's effectiveness, demonstrating its ability to produce high-quality, temporally consistent videos that maintain the detail and realism of their image-based counterparts.

Key features include:

Integrates Distribution-Preserving VAE and 3D Diffusion Transformers
Enables efficient joint image-video training
Supports high-quality, temporally consistent video synthesis
Leverages pre-trained image diffusion models for video generation
Utilizes 3D positional embeddings for advanced spatiotemporal modeling
Minimal increase in model parameters for video capabilities

Subscribe to the AI Search Newsletter