Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren

2026-04-15

Lyra 2.0: Explorable Generative 3D Worlds

Summary

This paper introduces a new system, Lyra 2.0, for creating large, detailed 3D worlds by first generating videos of someone walking through the world and then turning those videos into actual 3D models.

What's the problem?

Creating long, consistent videos for these 3D worlds is really hard. Current video generation models struggle with two main issues: 'spatial forgetting,' where the model forgets what earlier parts of the world look like when it revisits them, and 'temporal drifting,' where small errors in each frame build up over time, causing the scene to become distorted and inaccurate. Imagine trying to draw a long road – it’s easy to make it curve or change width without realizing it as you go.

What's the solution?

Lyra 2.0 tackles these problems in two ways. First, it remembers the 3D structure of the world frame by frame, using this information to help it recall past areas without actually relying on it to *draw* those areas, letting the video generation model focus on making things look good. Second, it trains the model to recognize and correct its own mistakes by showing it examples of videos that already have those errors, teaching it to fix the 'drifting' effect. This allows for much longer and more accurate video generation.

Why it matters?

This is important because it allows for the creation of much larger and more complex virtual worlds. These worlds could be used for things like realistic game environments, detailed simulations, or even virtual reality experiences, all built more efficiently than previous methods.

Abstract

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

View Paper