MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Linghao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xiaoshuai Zhang, Isabella Liu, Hongzhi Wu, Hao Su

2024-08-20

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Summary

This paper introduces MeshFormer, a new model designed to generate high-quality 3D meshes from images by using advanced techniques that incorporate 3D structures.

What's the problem?

Creating accurate 3D models from images can be difficult and expensive, especially with existing methods that struggle to produce high-quality results. Many current models do not effectively use the three-dimensional information available in images, leading to poor mesh quality.

What's the solution?

MeshFormer addresses these challenges by using a unique approach that combines 3D structures with advanced machine learning techniques. Instead of relying on traditional methods, it uses a sparse representation of 3D data and integrates transformers with 3D convolutions. This allows the model to learn better from images and generate more detailed meshes. It also utilizes normal maps, which help refine the geometry of the generated models, and combines different types of training data to improve performance.

Why it matters?

This research is significant because it enhances the ability to create realistic 3D models from images, which can be applied in various fields such as gaming, virtual reality, and robotics. By improving the quality and efficiency of 3D mesh generation, it opens up new possibilities for creating detailed digital environments and objects.

Abstract

Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry's learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks. Project page: https://meshformer3d.github.io

View Paper