Key Features

Interactive world generation from images, text, or videos
Camera motion quantization for stable training and user-friendly interaction
Masked Video Diffusion Transformer (MVDT) for infinite autoregressive generation
Training-free Anti-Artifact Mechanism (AAM) for enhanced visual quality
Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) for precise control
Synergistic optimization of adversarial distillation and caching mechanisms for model acceleration
High-fidelity and interactive video world generation
Trained on the high-quality world exploration dataset Sekai

Yume's technical framework includes several key components. Camera motion quantization translates camera trajectories into intuitive directional controls and rotational actions, mapped to keyboard input. The Masked Video Diffusion Transformer (MVDT) with frame memory enables infinite autoregressive generation, maintaining consistency across long sequences. Additionally, Yume uses training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) to enhance visual quality and control.


Yume is trained on the high-quality world exploration dataset Sekai and achieves remarkable results across diverse scenes and applications. The model's resources, including data, codebase, and model weights, are available on GitHub. Yume will update monthly to achieve its original goal of creating interactive, realistic, and dynamic worlds from various inputs. The model's potential applications include image and video editing, virtual reality, and more.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!