Durian: Dual Reference-guided Portrait Animation with Attribute Transfer
Hyunsoo Cha, Byungjun Kim, Hanbyul Joo
2025-09-05
Summary
This paper introduces Durian, a new computer program that can create realistic videos of a person's face changing to reflect characteristics from another image, all without needing to be specifically trained on those characteristics beforehand.
What's the problem?
Creating videos where someone's facial features change – like adding glasses, changing hair color, or making them smile – is difficult because it requires the program to understand how those features look and how they should be applied to a moving face consistently across every frame of the video. Existing methods often struggle with making these changes look natural and keeping them stable throughout the video, especially when trying to combine multiple changes at once.
What's the solution?
The researchers tackled this by using a technique called a diffusion model, which is good at generating realistic images. They improved it with something called 'dual reference networks,' which essentially allows the program to look at both the person in the video and the image with the desired features at the same time. This helps it transfer the features more accurately. They also developed a clever training method where the program learns to reconstruct video frames based on a reference image showing the desired features and the original video. To handle features of different sizes, they expanded the 'masks' used to define where the changes should be applied. Finally, they made the program more robust by showing it lots of examples with slightly different positions and appearances during training.
Why it matters?
Durian is important because it's the first system that can convincingly animate a face with new attributes from a single reference image without needing specific training for each attribute. This means you could, for example, make someone in a video appear to wear any pair of glasses just by showing the program a picture of those glasses, and it would apply them realistically to the video. It also allows for combining multiple attribute changes in one go, opening up possibilities for creating more complex and personalized video effects.
Abstract
We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.