DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Hanbo Cheng, Limin Lin, Chenyu Liu, Pengcheng Xia, Pengfei Hu, Jiefeng Ma, Jun Du, Jia Pan

2024-10-21

Summary

This paper introduces DAWN, a new framework for generating realistic talking head videos from a single portrait and a speech audio clip using a non-autoregressive diffusion model.

What's the problem?

Creating talking head videos that look natural and lifelike is challenging. Most existing methods rely on autoregressive strategies, which generate video frames one at a time. This can lead to slow generation speeds and errors accumulating over time, making it hard to produce high-quality videos efficiently.

What's the solution?

To solve these issues, the authors developed DAWN (Dynamic Frame Avatar With Non-autoregressive diffusion), which generates entire video sequences all at once instead of frame by frame. DAWN has two main parts: one that creates realistic facial movements based on the audio input and another that generates head poses and blinking. This approach allows for faster generation of videos while ensuring that the lip movements and expressions match the audio accurately.

Why it matters?

This research is important because it improves the technology behind creating talking head videos, which can be used in various applications like virtual assistants, video games, and filmmaking. By enabling faster and more realistic video generation, DAWN could enhance how we interact with digital characters and improve storytelling in media.

Abstract

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.

View Paper