At the core of Yume 1.5 is a long-video generation framework based on joint temporal-spatial-channel modeling, which uses unified context compression with linear attention to maintain visual quality over long sequences without causing memory or computation to explode. This design allows the model to handle extended durations while preserving consistency across time, space, and feature channels, resulting in worlds that feel visually stable and continuous. On top of this foundation, Yume 1.5 incorporates a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme, both of which work together to speed up inference and mitigate error accumulation as the video and interaction history grow.
Yume 1.5 also emphasizes interactivity and controllability through its support for keyboard-based exploration and text-controlled event generation. Users can move through the generated world using familiar WASD-style controls, enabling intuitive camera navigation and exploration of large, generated spaces without breaking temporal continuity. In addition, the system decomposes textual descriptions into event and action components, allowing precise control over dynamic world events and behaviors, so that prompts can specify not just what the world looks like, but also how it evolves and what happens within it over time.


