MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan
2024-12-06

Summary
This paper talks about MEMO, a new method for creating realistic talking videos that synchronize lip movements and facial expressions with audio input, making the generated videos look natural and consistent.
What's the problem?
Creating talking videos that match audio perfectly is challenging. Existing methods struggle with keeping the lip movements in sync with the audio, maintaining a consistent identity of the person throughout the video, and producing natural facial expressions that align with the emotions in the audio.
What's the solution?
The authors developed MEMO, which uses two main components to solve these problems: a memory-guided temporal module that helps keep track of previous frames to ensure smooth motion and identity consistency, and an emotion-aware audio module that improves how audio and video interact. This module detects emotions from the audio to adjust facial expressions accordingly. Together, these components allow MEMO to generate high-quality talking videos that outperform previous methods.
Why it matters?
This research is important because it enhances the technology used in video generation, making it more effective for applications like animation, virtual reality, and content creation. By improving how videos are generated from images and audio, MEMO can help create more engaging and lifelike digital content.
Abstract
Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.