HunyuanVideo-Avatar's character image injection module replaces the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures dynamic motion and strong character consistency. The AEM extracts and transfers emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control. The FAA isolates the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios.
HunyuanVideo-Avatar surpasses state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios. The model's ability to generate high-fidelity audio-driven human animation for multiple characters makes it a valuable tool for various applications, such as video production, advertising, and social media. Its emotion-controllable and multi-character capabilities also make it suitable for use in industries such as entertainment, education, and healthcare.