Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang

2024-12-06

Structured 3D Latents for Scalable and Versatile 3D Generation

Summary

This paper talks about a new method for creating 3D objects called Structured LATent (SLAT), which allows for high-quality and flexible 3D generation by using advanced techniques to capture both the shape and appearance of objects.

What's the problem?

Creating 3D models can be complicated, and existing methods often struggle to produce high-quality results efficiently. Many current systems have limitations in how they represent and generate 3D objects, making it hard to adapt them for different uses or formats.

What's the solution?

The authors introduced SLAT, which combines a sparse 3D grid with detailed visual features from images. This method captures both the geometry (shape) and texture (appearance) of objects. They trained their models on a large dataset of diverse 3D objects, allowing the system to generate high-quality results based on text or image prompts. Additionally, SLAT supports different output formats and allows users to edit specific parts of the 3D models easily.

Why it matters?

This research is important because it improves how we create and manipulate 3D assets, making it easier for designers and developers to generate high-quality models for games, movies, and virtual reality. By providing flexible options for output formats and editing capabilities, this method can enhance creativity and efficiency in various fields.

Abstract

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

View Paper