One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang

2024-12-05

Summary

This paper introduces a new method for creating realistic and animated whole-body talking avatars using just a single image, overcoming challenges in gesture and expression control.

What's the problem?

Creating lifelike avatars that can talk and express emotions usually requires multiple videos from different angles or rotating the camera around the subject. This process can take a lot of time, and existing methods often struggle to provide precise control over the avatar's movements, gestures, and facial expressions, making it difficult to create truly realistic animations.

What's the solution?

To solve this problem, the researchers developed a novel pipeline that allows for the generation of a whole-body talking avatar from just one image. They address two main challenges: modeling complex movements and ensuring the avatar can perform new gestures and expressions. They use advanced techniques like pose-guided image-to-video diffusion models to create video frames that act as training examples, even if they are not perfect. To handle inconsistencies in these generated videos, they introduce a hybrid avatar representation that combines 3D models with regularizations to improve accuracy.

Why it matters?

This research is important because it simplifies the process of creating animated avatars, making it more accessible for various applications like video games, virtual reality, and online communication. By enabling the creation of realistic avatars from a single image, it opens up new possibilities for personalized digital interactions and enhances user experiences in digital environments.

Abstract

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

View Paper