Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

Ling Yang, Zixiang Zhang, Junlin Han, Bohan Zeng, Runjia Li, Philip Torr, Wentao Zhang

2024-10-14

Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

Summary

This paper introduces Semantic Score Distillation Sampling (SemanticSDS), a new method for generating high-quality 3D models from text descriptions, improving the way complex scenes with multiple objects are created.

What's the problem?

Creating detailed 3D models from text is challenging because there isn't enough 3D data available. Existing methods often struggle to generate complex scenes with many objects or interactions, and those that use layout guidance can be too rough and lack detail.

What's the solution?

SemanticSDS improves the process by using new semantic embeddings that help maintain consistency across different views of the 3D model. It transforms these embeddings into a semantic map that guides the generation process, allowing for more precise control over how objects are placed and interact in the scene. This method enhances the expressiveness and accuracy of the generated 3D content, making it easier to create complex scenes.

Why it matters?

This research is important because it advances the technology behind text-to-3D generation, making it possible to create more realistic and detailed 3D models from simple text prompts. This could have significant applications in fields like gaming, virtual reality, and product design, where high-quality 3D visuals are essential.

Abstract

Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. Due to the scarcity of 3D data, state-of-the-art approaches utilize pre-trained 2D diffusion priors, optimized through Score Distillation Sampling (SDS). Despite progress, crafting complex 3D scenes featuring multiple objects or intricate interactions is still difficult. To tackle this, recent methods have incorporated box or layout guidance. However, these layout-guided compositional methods often struggle to provide fine-grained control, as they are generally coarse and lack expressiveness. To overcome these challenges, we introduce a novel SDS approach, Semantic Score Distillation Sampling (SemanticSDS), designed to effectively improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views and clearly differentiate between various objects and parts. These embeddings are transformed into a semantic map, which directs a region-specific SDS process, enabling precise optimization and compositional generation. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models, thereby achieving superior quality in 3D content generation, particularly for complex objects and scenes. Experimental results demonstrate that our SemanticSDS framework is highly effective for generating state-of-the-art complex 3D content. Code: https://github.com/YangLing0818/SemanticSDS-3D

View Paper