CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

2026-04-22

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Summary

This paper introduces CityRAG, a new system for creating realistic and long-lasting videos of real-world locations, like a digital twin you can explore.

What's the problem?

Current video generation models can make videos from text or images, but they struggle to accurately recreate a specific real place, especially when you want to change things like the weather or move objects around. They also have trouble making videos that feel consistent and don't 'forget' where they are after a while, which is important for things like letting a robot practice driving in a virtual version of a city.

What's the solution?

CityRAG solves this by using a huge amount of real-world map and image data as a reference. It learns to separate the permanent parts of a scene (buildings, roads) from temporary things (weather, moving cars). This allows it to generate videos that are grounded in reality, can handle different conditions, and maintain a consistent sense of space over long periods, even allowing for 'looping' where the video seamlessly connects back to the beginning.

Why it matters?

This research is important because it opens the door to better simulations for things like training self-driving cars and robots. Instead of relying on simplified or unrealistic virtual environments, these systems can now practice in a highly accurate digital copy of the real world, making them safer and more reliable.

Abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

View Paper