FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li

2025-03-19

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View
Synthesis

Summary

This paper introduces FlexWorld, a system that creates 3D scenes that you can explore from different angles, even with just a single picture.

What's the problem?

It's difficult to create a full 3D scene, especially one you can move around in, from just one image because you're missing a lot of information about what's behind objects and on the sides.

What's the solution?

FlexWorld uses AI to generate new views of the scene from the single image and then stitches them together to create a complete 3D environment. It uses a powerful video model to create realistic new views and carefully combines them to make a consistent 3D scene.

Why it matters?

This is important because it allows you to create immersive 3D experiences from simple images, which could be used in virtual reality, gaming, or other applications where you want to explore a scene from different perspectives.

Abstract

Generating flexible-view 3D scenes, including 360{\deg} rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360{\deg} rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.

View Paper