MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Sankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal

2024-11-27

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Summary

This paper introduces MARVEL-40M+, a large dataset designed to improve the process of creating high-quality 3D content from text descriptions by providing extensive annotations and a new way to generate 3D models.

What's the problem?

Creating realistic 3D models from text prompts is difficult because existing datasets are often too small, lack diversity, and don't provide detailed annotations. This makes it hard for models to understand what to create, leading to lower quality outputs.

What's the solution?

MARVEL-40M+ addresses these issues by offering a dataset with 40 million text annotations for over 8.9 million 3D assets collected from various sources. The authors developed a multi-stage annotation process that combines advanced AI models to produce detailed descriptions and concise tags for each 3D object. They also created a two-stage pipeline called MARVEL-FX3D, which uses these annotations to generate textured 3D models quickly and efficiently.

Why it matters?

This research is important because it significantly enhances the ability of AI systems to generate high-fidelity 3D content from text. By providing a comprehensive dataset and innovative methods, MARVEL-40M+ can help advance fields like gaming, virtual reality, and film production, making it easier for creators to develop realistic and engaging 3D environments.

Abstract

Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.

View Paper