SkyReels-Audio introduces a facial mask loss and an audio-guided classifier-free guidance mechanism to enhance local facial coherence. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. The framework also supports video editing, allowing for lip movement alignment given reference videos and audio clips. This makes it a valuable tool for applications such as video production, advertising, and social media.


SkyReels-Audio has been evaluated on comprehensive benchmark evaluations and has achieved superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions. The framework can handle reference images of different objectives, sizes, and styles, and claims naturally consistent video results. This makes it a promising technology for various industries, including entertainment, education, and healthcare. Its ability to generate realistic and coherent talking portraits makes it a valuable asset for content creators and producers.

Key Features

Unified framework for talking portrait video synthesis
Infinite-length generation and editing
Diverse and controllable conditioning through multimodal inputs
Hybrid curriculum learning strategy for audio-facial motion alignment
Facial mask loss and audio-guided classifier-free guidance
Sliding-window denoising approach for temporal consistency
Video editing capabilities for lip movement alignment
Support for reference images of different objectives, sizes, and styles

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!