SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, Changick Kim

2024-11-26

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Summary

This paper introduces SplatFlow, a new framework that allows users to easily generate and edit 3D scenes based on text descriptions, making it simpler to create complex visual content.

What's the problem?

Creating and editing 3D scenes can be complicated and often requires specialized tools that focus on specific tasks. Existing methods typically do not provide a unified approach for both generating new scenes and editing them, which can make the content creation process less efficient and more difficult for users.

What's the solution?

SplatFlow addresses this issue by combining two main components: a multi-view rectified flow model and a Gaussian Splatting Decoder. The multi-view model generates multiple images, depth information, and camera positions all at once based on text prompts. This helps manage the challenges of different scene sizes and camera movements. The Gaussian Splatting Decoder then converts these generated outputs into detailed 3D representations. This framework allows for seamless editing of 3D scenes without needing complex additional processes.

Why it matters?

This research is important because it streamlines the process of creating and editing 3D content, making it more accessible to creators in fields like gaming, animation, and virtual reality. By allowing users to generate detailed 3D scenes from simple text descriptions, SplatFlow can significantly enhance creativity and productivity in digital content creation.

Abstract

Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.

View Paper