Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein
2024-07-19

Summary
This paper introduces a new method for creating long video sequences of city streets using a technique called autoregressive video diffusion. It allows users to generate realistic street views based on specific inputs like city names and weather conditions.
What's the problem?
Generating long, consistent videos of street scenes is difficult because most existing methods struggle to maintain visual quality and coherence over extended sequences. Traditional video generation techniques often fail when trying to cover large areas or when the camera moves over several blocks.
What's the solution?
The authors developed a system that combines video diffusion models with a new method for ensuring consistency across long video sequences. They use maps and text prompts to guide the generation process, allowing for detailed control over the scenes produced. Their approach prevents the generated images from drifting away from realistic representations of city environments, resulting in high-quality outputs that span long distances.
Why it matters?
This research is significant because it enhances how we can visualize and interact with urban environments in digital formats. The ability to generate realistic street views has applications in urban planning, virtual reality, and gaming, making it easier to create immersive experiences that reflect real-world locations.
Abstract
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.