Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds
Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, Xinggang Wang
2025-08-22
Summary
This paper introduces a new method for creating 3D models of people from just two pictures: one from the front and one from the back. It aims to make it easier for anyone to create their own 3D digital human.
What's the problem?
The biggest challenge is that only having a front and back view provides very limited information. It's hard to ensure the 3D model is accurate and consistent when there's so little overlap between the images, and a lot of the body's shape is hidden. Essentially, the system needs to 'fill in the gaps' and make sure everything connects realistically.
What's the solution?
The researchers built upon existing 3D reconstruction technology, specifically modifying it to work well with these limited views. They trained the system using a lot of data of people to predict the 3D shape, even with the sparse input. Then, they added a step to estimate and add realistic color to the model. Finally, they convert the 3D model into a format called '3D Gaussians' which makes the final rendering look much better.
Why it matters?
This work is important because it significantly lowers the barrier to entry for creating 3D human models. Previously, it required many images or specialized equipment. This method can work with just two regular photos, even those taken with a smartphone, making it much more accessible and faster – the system can create a model in less than a fifth of a second on a powerful computer.
Abstract
Reconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at https://hustvl.github.io/Snap-Snap/.