A key innovation in FantasyTalking is its facial-focused cross-attention module, which replaces traditional reference networks to better preserve the identity of the subject throughout the animation. This module ensures that the unique facial features and expressions of the original portrait are retained, even as the model generates expressive and dynamic motion. Additionally, FantasyTalking integrates a motion intensity modulation network, allowing users to explicitly control the strength and style of facial expressions and body movements. This enables the generation of videos that are not only synchronized with speech but also rich in emotional nuance and natural movement, enhancing the realism and engagement of the animated avatars.
FantasyTalking stands out for its versatility and accessibility. It supports a wide range of avatar styles, from photorealistic to cartoon, and can animate characters and animals in various body poses, including close-up, half-body, and full-body orientations. The open-source release includes inference code and model weights, making it readily available for research, creative projects, and integration into broader video generation workflows. Extensive evaluations show that FantasyTalking outperforms previous methods in video quality, identity preservation, motion diversity, and lip synchronization, positioning it as a leading solution for realistic, controllable talking portrait generation.