HunyuanVideo-Avatar's character image injection module replaces the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures dynamic motion and strong character consistency. The AEM extracts and transfers emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control. The FAA isolates the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios.


HunyuanVideo-Avatar surpasses state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios. The model's ability to generate high-fidelity audio-driven human animation for multiple characters makes it a valuable tool for various applications, such as video production, advertising, and social media. Its emotion-controllable and multi-character capabilities also make it suitable for use in industries such as entertainment, education, and healthcare.

Key Features

Multimodal diffusion transformer (MM-DiT)-based model
Generates dynamic, emotion-controllable, and multi-character dialogue videos
Character image injection module for dynamic motion and strong character consistency
Audio Emotion Module (AEM) for fine-grained and accurate emotion style control
Face-Aware Audio Adapter (FAA) for independent audio injection via cross-attention
Enables multi-character audio-driven animation
Generates realistic avatars in dynamic, immersive scenarios
Suitable for various applications, including video production, advertising, and social media

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!