Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
2024-12-06
Summary
This paper introduces a new method for creating high-quality 3D models called Structured LATent (SLAT), which allows for flexible and efficient generation of 3D assets from various inputs.
What's the problem?
Creating detailed and versatile 3D models can be challenging because existing methods often lack the ability to easily switch between different formats or fail to capture both the shape and texture of objects effectively. This limits the quality and usability of the generated 3D assets in various applications.
What's the solution?
The researchers developed SLAT, a unified representation that combines a sparse 3D grid with detailed visual features from advanced vision models. This approach allows the model to generate different output types, such as Radiance Fields, 3D Gaussians, and meshes, while maintaining high quality. They used rectified flow transformers to process this information and trained the model on a large dataset of diverse 3D objects, enabling it to produce high-quality results based on text or image prompts. Additionally, SLAT supports local editing of 3D models, allowing users to make specific changes without affecting the entire structure.
Why it matters?
This research is important because it enhances the ability to create and manipulate 3D assets efficiently, making it easier for designers and developers in fields like gaming, virtual reality, and animation. By providing a flexible tool that can generate high-quality models from both text and images, SLAT can significantly improve workflows in digital content creation.
Abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.