Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang

2025-08-29

Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Summary

This paper introduces Dress&Dance, a new system that creates realistic videos of people virtually trying on clothes. It takes a single picture of a person and a video of them moving, then generates a video showing them wearing the clothes you want, moving along with the original video.

What's the problem?

Existing virtual try-on technologies often produce low-quality results, looking unnatural or failing to accurately reflect how clothes move with a person. Creating these videos is difficult because it requires understanding both how clothes look and how they behave when someone is moving, and good training data for this is hard to come by.

What's the solution?

The researchers developed a system called Dress&Dance that uses a special network, CondNet, to combine information from text (describing the clothes), images (of the clothes), and videos (of the person’s movements). CondNet uses something called 'attention' to focus on the most important parts of each input, making the virtual try-on more accurate and realistic. They also cleverly trained CondNet using a combination of limited video data and a much larger collection of still images, gradually improving its performance.

Why it matters?

Dress&Dance is a significant step forward in virtual try-on technology because it produces higher quality videos than existing options, both free and commercial. This could make online clothes shopping much more convenient and help people visualize how clothes will look on them before they buy, potentially reducing returns and improving customer satisfaction.

Abstract

We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.

View Paper