SkyReels-Audio introduces a facial mask loss and an audio-guided classifier-free guidance mechanism to enhance local facial coherence. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. The framework also supports video editing, allowing for lip movement alignment given reference videos and audio clips. This makes it a valuable tool for applications such as video production, advertising, and social media.
SkyReels-Audio has been evaluated on comprehensive benchmark evaluations and has achieved superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions. The framework can handle reference images of different objectives, sizes, and styles, and claims naturally consistent video results. This makes it a promising technology for various industries, including entertainment, education, and healthcare. Its ability to generate realistic and coherent talking portraits makes it a valuable asset for content creators and producers.