Unlike traditional approaches that simply add temporal layers to image diffusion models, VideoDiT introduces a novel approach to video encoding and generation. The framework uses a DP-VAE that encodes key frames of a video using the original 2D VAE, while non-key frames are compressed with a 3D VAE, ensuring efficient spatiotemporal modeling. This unique combination allows for seamless transfer of knowledge from pre-trained image diffusion models to video generation tasks. The use of 3D positional embeddings and the extension of 2D attention mechanisms into 3D space enable VideoDiT to model complex video dynamics with negligible increase in computational overhead.


VideoDiT supports joint image-video training, preserving the spatial modeling capabilities of the base image generation model while excelling in both static and dynamic content creation. This dual capability allows users to generate high-fidelity videos and images within a unified framework, streamlining workflows for applications such as content creation, animation, and synthetic data generation. Extensive experiments have validated VideoDiT's effectiveness, demonstrating its ability to produce high-quality, temporally consistent videos that maintain the detail and realism of their image-based counterparts.


Key features include:


  • Integrates Distribution-Preserving VAE and 3D Diffusion Transformers
  • Enables efficient joint image-video training
  • Supports high-quality, temporally consistent video synthesis
  • Leverages pre-trained image diffusion models for video generation
  • Utilizes 3D positional embeddings for advanced spatiotemporal modeling
  • Minimal increase in model parameters for video capabilities

Get more likes & reach the top of search results by adding this button on your site!

Featured on

AI Search

57

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!