Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo
2026-04-08
Summary
This paper introduces Vanast, a new system that creates realistic videos of people wearing different clothes and moving around, all starting from just a single picture of a person, pictures of clothes, and a video showing how the person should move.
What's the problem?
Existing methods for digitally 'trying on' clothes and animating people are often done in separate steps. This can lead to problems like the person in the video looking slightly different over time, the clothes not fitting correctly or looking distorted, and the front and back of the clothes not matching up realistically. Basically, things don't look quite right because the two parts aren't working together seamlessly.
What's the solution?
Vanast solves this by doing everything – the virtual try-on *and* the animation – in one single process. To make this possible, the researchers created a huge dataset of images showing people in different outfits. They also developed a special architecture for the system, called a Dual Module, which helps it learn more effectively and create more accurate and stable videos. This architecture keeps the original person's appearance consistent, makes sure the clothes fit well, and follows the movements in the guidance video.
Why it matters?
This work is important because it allows for the creation of much more realistic and believable animated videos of people wearing different clothes. This has potential applications in areas like virtual fashion shows, creating personalized avatars, and even in the movie industry, where it could reduce the cost and effort of creating realistic scenes with digital characters.
Abstract
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.