Key Features

Generates realistic, interactive, and continuous worlds from a single image or free-form text prompt.
Supports keyboard-based exploration with intuitive WASD controls for navigating the generated video world.
Long-video generation framework built on joint temporal-spatial-channel modeling for coherent, extended sequences.
Unified context compression with linear attention to maintain video quality while controlling memory and computation costs.
Real-time streaming acceleration strategy powered by bidirectional attention distillation to enable fast, responsive inference.
Enhanced text embedding scheme that stabilizes long-horizon generation and reduces error accumulation in extended interactions.
Text-controlled event generation that decomposes captions into event and action descriptions for precise dynamic control.
Designed to address prior limitations of large parameter sizes, lengthy inference steps, and unmanageable historical context in interactive world generation.

At the core of Yume 1.5 is a long-video generation framework based on joint temporal-spatial-channel modeling, which uses unified context compression with linear attention to maintain visual quality over long sequences without causing memory or computation to explode. This design allows the model to handle extended durations while preserving consistency across time, space, and feature channels, resulting in worlds that feel visually stable and continuous. On top of this foundation, Yume 1.5 incorporates a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme, both of which work together to speed up inference and mitigate error accumulation as the video and interaction history grow.


Yume 1.5 also emphasizes interactivity and controllability through its support for keyboard-based exploration and text-controlled event generation. Users can move through the generated world using familiar WASD-style controls, enabling intuitive camera navigation and exploration of large, generated spaces without breaking temporal continuity. In addition, the system decomposes textual descriptions into event and action components, allowing precise control over dynamic world events and behaviors, so that prompts can specify not just what the world looks like, but also how it evolves and what happens within it over time.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!