Grounding World Simulation Models in a Real-World Metropolis

Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim

2026-03-17

Grounding World Simulation Models in a Real-World Metropolis

Summary

This paper introduces a new type of computer model called Seoul World Model (SWM) that can create realistic videos of a real city – Seoul, South Korea – as if you were driving or walking through it. Unlike other models that invent environments, SWM is based on actual street-level images.

What's the problem?

Building this model wasn't easy. The biggest challenges were making sure the generated videos looked consistent over time, creating a variety of possible routes through the city, and dealing with the fact that the original images used to build the model weren't taken often enough or from enough angles. Basically, it's hard to make a smooth, realistic video from a limited set of real-world snapshots.

What's the solution?

The researchers tackled these problems in a few key ways. First, they figured out how to match up images taken at different times to create a more continuous view. Second, they created a huge amount of fake data to simulate different camera paths and viewpoints. Finally, they developed a technique to constantly check the generated video against real street-view images as it's being created, ensuring it stays grounded in reality and doesn't drift into unrealistic territory.

Why it matters?

This work is important because it's a big step towards creating truly realistic simulations of the real world. This could have huge implications for things like training self-driving cars, developing better navigation apps, or even creating immersive virtual reality experiences. It shows we're getting closer to being able to digitally recreate and interact with actual places.

Abstract

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

View Paper