SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan
2024-11-27

Summary
This paper introduces SAR3D, a new framework that uses advanced techniques to quickly generate and understand 3D objects based on text or images.
What's the problem?
Creating and understanding 3D objects using AI has been difficult because existing methods are often slow and inefficient. Many approaches do not effectively capture the complexity of 3D shapes, making it hard for models to generate high-quality results in a reasonable amount of time.
What's the solution?
SAR3D addresses these challenges by using a multi-scale 3D vector-quantized variational autoencoder (VQVAE) that breaks down 3D objects into smaller, manageable pieces. Instead of predicting one token at a time, SAR3D predicts the next scale in a multi-scale representation, which speeds up the generation process significantly. The framework also allows for fine-tuning large language models (LLMs) to help them understand and describe 3D objects better.
Why it matters?
This research is important because it enhances the ability of AI to create and interpret 3D content quickly and accurately. By improving how models generate 3D objects, SAR3D can benefit various fields such as gaming, virtual reality, and design, making it easier for creators to visualize and manipulate complex structures.
Abstract
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.