RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Jian Yang, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Donghao Luo, Chengjie Wang

2024-07-02

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Summary

This paper talks about RealTalk, a new system designed to generate realistic facial animations driven by audio in real-time. It focuses on improving how well the generated lip movements match the spoken audio while maintaining the unique features of each person's face.

What's the problem?

Creating realistic face animations that sync perfectly with audio is a challenging task. Previous methods have made progress, but they often struggle with two main issues: preserving individual facial traits for accurate lip synchronization and generating high-quality facial images quickly enough for real-time applications. This means that while the technology is improving, it still isn't practical for everyday use in things like video calls or games.

What's the solution?

To solve these problems, the authors developed RealTalk, which includes two main parts: an audio-to-expression transformer and a high-fidelity expression-to-face renderer. The first part focuses on understanding how different people's faces move when they speak, using audio cues to predict facial expressions accurately. The second part uses a lightweight facial identity alignment module to create detailed facial images quickly without needing complicated setups. This combination allows RealTalk to produce high-quality animations that match the audio closely and do so in real time.

Why it matters?

This research is important because it advances the field of audio-driven face generation, making it more practical for real-world applications. By improving lip synchronization and rendering quality while being efficient, RealTalk could enhance technologies like virtual reality, gaming, and online communication, allowing for more natural interactions between people and digital avatars.

Abstract

Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

View Paper