Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen
2025-12-10
Summary
This paper introduces a new method called LivingSwap for swapping faces in videos, aiming to make the process more realistic and less work for filmmakers and visual effects artists.
What's the problem?
Currently, swapping faces in videos, especially long or complicated ones, is really hard to do well. Existing methods often struggle to maintain a consistent look for the swapped face throughout the entire video, and the result can look unnatural or 'glitchy'. It's difficult to make the new face perfectly match the lighting, expressions, and movements of the original face over time.
What's the solution?
LivingSwap tackles this by using 'keyframes' – specific frames from the target video – to guide the face swapping process. Think of it like giving the model examples of what the final face should look like at certain points. It also uses the entire source video as a reference, helping it maintain consistency in things like lighting and expression. To help train the model, the researchers created a new dataset of face-swapped videos, ensuring there was enough data for the system to learn effectively.
Why it matters?
This research is important because it significantly improves the quality and efficiency of video face swapping. By reducing the amount of manual editing needed, it can save a lot of time and money in film and video production, and open up possibilities for more realistic and seamless visual effects.
Abstract
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap