IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, Wei Liu

2024-12-23

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Summary

This paper talks about IDOL, a new method for creating realistic 3D human avatars from just a single image. It focuses on how to quickly and accurately generate full-body models that can be animated.

What's the problem?

Creating detailed and lifelike 3D models of humans from a single image is very challenging. This is because people can look very different based on their poses and appearances, and there isn't enough high-quality data available to train models effectively. Traditional methods often struggle to produce accurate and usable 3D representations.

What's the solution?

To solve this problem, the authors developed a large dataset called HuGe100K, which includes 100,000 diverse images of humans in various poses. They then created a new model that uses this dataset to generate a 3D representation of a person from one image. This model can separate different aspects of the human figure, like body shape and clothing, allowing it to create accurate and animatable avatars without needing extensive post-processing. The model works quickly, producing high-quality results using just one GPU.

Why it matters?

This research is important because it significantly advances the ability to create realistic 3D human models from minimal input. This technology can be used in video games, movies, virtual reality, and other fields where realistic human representations are needed. By making the process faster and more accessible, it opens up new possibilities for creators and developers.

Abstract

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks.

View Paper