Collaborative Multi-Modal Coding for High-Quality 3D Generation

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

2025-08-29

Collaborative Multi-Modal Coding for High-Quality 3D Generation

Summary

This paper introduces a new method, called TriMM, for creating 3D models using artificial intelligence. It focuses on using different types of data – like regular images, images with depth information, and point clouds – together to build more detailed and realistic 3D objects.

What's the problem?

Currently, most AI systems that generate 3D models only focus on one type of data at a time. For example, some use only images, while others use only point clouds. This limits the quality of the models because each type of data has its own strengths. Images provide good textures, but point clouds are better at defining the shape. Also, many systems need huge amounts of training data, which isn't always available.

What's the solution?

TriMM solves this by combining information from multiple data types – images, images with depth, and point clouds – in a smart way. It first learns to understand the unique features of each data type without losing what makes them special. Then, it uses extra checks during training to make sure the system is learning correctly. Finally, it uses a special type of AI model called a 'triplane latent diffusion model' to actually create the 3D models, resulting in better textures and shapes.

Why it matters?

This research is important because it shows that you can create high-quality 3D models even with a relatively small amount of training data, by effectively combining different types of information. This opens up possibilities for creating 3D content more easily and efficiently, and it suggests that other types of data could be added to further improve the results.

Abstract

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

View Paper