HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang
2026-04-17
Summary
This paper introduces HY-World 2.0, a new system that can create detailed 3D virtual worlds from different kinds of inputs like text descriptions, single pictures, multiple pictures, or videos.
What's the problem?
Creating realistic and interactive 3D worlds from simple inputs is really hard. Existing methods often struggle with generating high-quality visuals, understanding the scene well enough to allow movement within it, and maintaining consistency when building the world from different viewpoints. Essentially, it's difficult to get a computer to 'imagine' a full 3D environment based on limited information.
What's the solution?
The researchers built HY-World 2.0, which works in four main steps. First, it creates a panoramic view of the scene. Then, it plans a path for someone to move through the world. Next, it expands the world by adding more views from different angles, making sure everything stays consistent. Finally, it combines all these pieces to create a complete 3D world. They also developed a new rendering platform called WorldLens to display these worlds interactively, even with characters in them, and improved the underlying models to make everything more realistic and efficient.
Why it matters?
This work is important because it pushes the boundaries of what's possible in creating 3D worlds automatically. It achieves results comparable to some of the best, but closed-source, systems currently available. By releasing all the code and models, the researchers are helping other scientists build upon this work, potentially leading to advancements in areas like virtual reality, game development, and robotics where realistic 3D environments are crucial.
Abstract
We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.