Key Features

Generates 720p minute-scale video from one image and a camera trajectory.
Uses a 2.6B-parameter Hybrid Linear Diffusion Transformer architecture.
Combines Gated DeltaNet and softmax attention for memory-efficient long-context modeling.
Supports precise 6-DoF camera control through a dual-branch camera-control design.
Applies a two-stage generation pipeline with long-video refinement.
Trains from public video clips with metric-scale camera-pose supervision.
Targets interactive world modeling, embodied AI, and camera-controlled video generation.
Provides public paper, code, and model resources for research use.

The architecture is a 2.6B-parameter open-source world model with a Hybrid Linear Diffusion Transformer. It combines frame-wise Gated DeltaNet and softmax attention for long-context modeling, uses dual-branch camera control for 6-DoF trajectory adherence, and applies a two-stage pipeline with a long-video refiner. These design choices help SANA-WM maintain temporal consistency and visual quality over longer sequences than typical short-form video generators.


SANA-WM is valuable for researchers and developers building explorable AI worlds, robotics simulators, camera-controlled video tools, or data engines for embodied agents. Its efficient training and inference profile makes it notable because it uses public video data with metric-scale pose supervision rather than depending only on massive closed datasets. The release provides paper, code, and model links, so it is listed as a free open-source world-model project.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!