CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang
2025-01-27
Summary
This paper introduces CatV2TON, a new virtual try-on technology that uses a single model to create realistic clothing visualizations in both images and videos. It's designed to work better and more efficiently than existing methods, especially for longer videos.
What's the problem?
Current virtual try-on technologies often struggle to produce high-quality results for both images and videos, particularly when dealing with longer video sequences. This limitation makes it difficult for online retailers to offer a consistent and realistic virtual try-on experience across different formats.
What's the solution?
The researchers developed CatV2TON, which combines clothing and person information in a special way called temporal concatenation. They trained this model using a mix of image and video data. To handle longer videos efficiently, they created a new method that processes video in overlapping chunks and uses something called Adaptive Clip Normalization to keep everything looking smooth. They also made a new dataset called ViViD-S by improving existing video data for better quality.
Why it matters?
This research matters because it could significantly improve online shopping experiences. Better virtual try-on technology means shoppers can more accurately see how clothes will look on them before buying, potentially reducing returns and increasing customer satisfaction. For businesses, this could lead to increased sales and lower costs. The ability to handle both images and videos with a single, efficient model also makes it more practical for companies to implement this technology across their online platforms.
Abstract
Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.