Yume's technical framework includes several key components. Camera motion quantization translates camera trajectories into intuitive directional controls and rotational actions, mapped to keyboard input. The Masked Video Diffusion Transformer (MVDT) with frame memory enables infinite autoregressive generation, maintaining consistency across long sequences. Additionally, Yume uses training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) to enhance visual quality and control.
Yume is trained on the high-quality world exploration dataset Sekai and achieves remarkable results across diverse scenes and applications. The model's resources, including data, codebase, and model weights, are available on GitHub. Yume will update monthly to achieve its original goal of creating interactive, realistic, and dynamic worlds from various inputs. The model's potential applications include image and video editing, virtual reality, and more.