End-to-End Video Character Replacement without Structural Guidance
Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li
2026-01-14
Summary
This paper introduces a new method, called MoCha, for swapping the face of a person in a video with a different face provided by the user, without needing a lot of complicated setup or perfect video data.
What's the problem?
Currently, changing faces in videos is really hard because most methods need a precise outline of the person in *every single frame* of the video, along with information about their body structure like a skeleton. This is a huge problem because it doesn't work well when people are partially hidden, interacting with objects, in weird poses, or in bad lighting – it leads to glitchy and unrealistic results. Also, getting all that frame-by-frame data is incredibly time-consuming and difficult.
What's the solution?
MoCha solves this by only needing a single mask, meaning you only need to outline the person's face in just one frame of the video. It uses a clever technique to understand the video and the new face, and then a process similar to learning through trial and error to make the swap look natural. To help with this, the researchers also created three new datasets: one made with a high-end video game engine, one created using animation techniques, and one that improves existing video data. These datasets help train the system to be more accurate.
Why it matters?
This research is important because it makes face-swapping in videos much more accessible and realistic. By removing the need for detailed, frame-by-frame data, it opens the door to more creative applications and reduces the effort required to edit videos. The code will be released, allowing other researchers to build upon this work and further improve the technology.
Abstract
Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha