GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Lorenza Prospero, Abdullah Hamdi, Joao F. Henriques, Christian Rupprecht

2024-09-09

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Summary

This paper talks about GST, a method for creating accurate 3D human body models from just one image using advanced techniques like Gaussian splatting and transformers.

What's the problem?

Creating realistic 3D models of humans from a single photo is difficult because it involves complex shapes and poses, and the model needs to understand how different clothing styles can affect the appearance. Traditional methods struggle with these challenges, especially when trying to capture all the details accurately.

What's the solution?

The authors developed a method called Gaussian Splatting Transformers (GST) that uses a mixture of Gaussian functions to represent the human body in 3D. They start with a standard human model and make small adjustments based on the input image. This approach allows them to generate detailed 3D models quickly without needing extensive additional data or complex calculations during the testing phase. They also show that their method improves how well the model can estimate human poses and adapt to different clothing styles.

Why it matters?

This research is important because it has practical applications in areas like video games, virtual reality, and healthcare, where understanding human shapes and movements is crucial. By making it easier to create accurate 3D human models from just one image, GST can help enhance user experiences in various technologies.

Abstract

Reconstructing realistic 3D human models from monocular images has significant applications in creative industries, human-computer interfaces, and healthcare. We base our work on 3D Gaussian Splatting (3DGS), a scene representation composed of a mixture of Gaussians. Predicting such mixtures for a human from a single input image is challenging, as it is a non-uniform density (with a many-to-one relationship with input pixels) with strict physical constraints. At the same time, it needs to be flexible to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate density and approximate initial position for Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other Gaussians' attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve fast inference of 3D human models from a single image without test-time optimization, expensive diffusion models, or 3D points supervision. We also show that it can improve 3D pose estimation by better fitting human models that account for clothes and other variations. The code is available on the project website https://abdullahamdi.com/gst/ .

View Paper