Map2World: Segment Map Conditioned Text to 3D World Generation

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang, Jiaolong Yang, Kyoung Mu Lee

2026-05-04

Map2World: Segment Map Conditioned Text to 3D World Generation

Summary

This paper introduces a new method called Map2World for creating detailed and consistent 3D worlds, like those needed for video games or simulations.

What's the problem?

Currently, creating large 3D worlds is difficult because existing methods often rely on simple, grid-like structures which can look unnatural and have problems with objects appearing at inconsistent sizes throughout the world. It's hard to make a world that feels both large and realistically scaled.

What's the solution?

Map2World solves this by letting users define the basic layout of the world using custom maps of any shape. It then builds the 3D world based on this map, ensuring everything is scaled correctly across the entire environment. To add detail, they also created a 'detail enhancer' network that adds small features without messing up the overall structure, and the system is designed to work well even with limited examples to learn from.

Why it matters?

This research is important because it allows for much more control and realism when generating 3D worlds. It means creators can build more complex and believable environments more easily, which is useful for things like designing realistic simulations for self-driving cars or creating immersive video game experiences.

Abstract

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.

View Paper