Key Features

Lightweight High-Performance Architecture
Video Super-Resolution Enhancement
End-to-End Training Optimization
Supports Text-to-Video and Image-to-Video Generation
Runs Smoothly on Consumer-Grade GPUs
Efficient Architecture with 8.3B-Parameter Diffusion Transformer
Innovative SSTA Mechanism for Reduced Computational Overhead
Multi-Stage Progressive Training Strategy

The model proposes an efficient architecture that integrates an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE, achieving compression ratios of 16× in spatial dimensions and 4× along the temporal axis. Additionally, the innovative SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal kv blocks, significantly reduces computational overhead for long video sequences and accelerates inference. The model also develops an efficient few-step super-resolution network that upscales outputs to 1080p, enhancing sharpness while correcting distortions.


The model employs a multi-stage, progressive training strategy covering the entire pipeline from pre-training to post-training. Combined with the Muon optimizer to accelerate convergence, this approach holistically refines motion coherence, aesthetic quality, and human preference alignment, achieving professional-grade content generation. The model provides a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!