The architecture is a 2.6B-parameter open-source world model with a Hybrid Linear Diffusion Transformer. It combines frame-wise Gated DeltaNet and softmax attention for long-context modeling, uses dual-branch camera control for 6-DoF trajectory adherence, and applies a two-stage pipeline with a long-video refiner. These design choices help SANA-WM maintain temporal consistency and visual quality over longer sequences than typical short-form video generators.
SANA-WM is valuable for researchers and developers building explorable AI worlds, robotics simulators, camera-controlled video tools, or data engines for embodied agents. Its efficient training and inference profile makes it notable because it uses public video data with metric-scale pose supervision rather than depending only on massive closed datasets. The release provides paper, code, and model links, so it is listed as a free open-source world-model project.


