FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki, Dongchan Min, Gyoungsu Chae

2024-12-03

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Summary

This paper introduces FLOAT, a new method for generating talking portrait videos that are driven by audio, using advanced techniques to create realistic and expressive animations.

What's the problem?

Generating videos of talking portraits that look natural and maintain consistency over time can be difficult. Existing methods often rely on slow processes that take a long time to create each frame, leading to issues with video quality and speed. This makes it hard to produce high-quality animations quickly.

What's the solution?

FLOAT solves these problems by using a new approach called generative motion latent flow matching. Instead of working directly with pixel data, FLOAT focuses on a learned motion space that allows for smoother and more consistent movements in the portraits. It uses a transformer-based model to predict how the portrait should move based on audio input, including emotional expressions. This method allows for faster video generation while maintaining high visual quality and accurate motion.

Why it matters?

This research is important because it significantly improves how AI can create animated portraits that respond to audio, making them more lifelike and engaging. This technology can be applied in various fields, such as entertainment, virtual reality, and education, where realistic character animations can enhance user experience.

Abstract

With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

View Paper