Fashion-VDM: Video Diffusion Model for Virtual Try-On
Johanna Karras, Yingwei Li, Nan Liu, Luyang Zhu, Innfarn Yoo, Andreas Lugmayr, Chris Lee, Ira Kemelmacher-Shlizerman
2024-11-04

Summary
This paper introduces Fashion-VDM, a new video diffusion model designed to create virtual try-on videos, allowing users to see how clothing items would look on them in motion. The method focuses on maintaining the person's identity and movement while wearing the garment.
What's the problem?
While existing methods for virtual try-on using images have shown good results, video-based virtual try-on (VVT) still struggles with details and consistency in how garments appear over time. This can lead to unrealistic or poor-quality videos when trying to show how clothes fit and move on a person.
What's the solution?
Fashion-VDM addresses these challenges by using a diffusion-based architecture that improves the quality of generated videos. It incorporates techniques like split classifier-free guidance for better control over the garment details and a progressive training strategy that allows for the generation of high-quality videos in one go. The model can create up to 64 frames of video at a resolution of 512 pixels, ensuring that the clothing looks realistic and fits well with the person's movements.
Why it matters?
This research is important because it sets a new standard for virtual try-on technology, making it easier for consumers to visualize clothing before purchasing. By improving the realism and detail in virtual try-on videos, Fashion-VDM can enhance online shopping experiences, helping customers make better decisions and reducing return rates due to poor fit or style.
Abstract
We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM.