EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, Yuhui Yin

2025-11-07

EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

Summary

This paper introduces EVTAR, a new computer vision model designed to realistically 'try on' clothes on a person in an image, directly showing what an outfit would look like.

What's the problem?

Existing virtual try-on technologies are complicated because they need a lot of extra information like precise body shapes, poses, or detailed outlines of the person in the image. Getting all this information is time-consuming and makes it hard to use these systems in everyday situations, like shopping online.

What's the solution?

EVTAR simplifies this process by only needing a picture of the person and the clothing item. It works in two steps during training, but when you actually want to try something on, it's very fast. A key part of EVTAR is that it also looks at pictures of *other* people wearing the same clothes. This helps it keep the details and texture of the garment looking realistic, just like how you might look at a model wearing an outfit to see how it fits and looks.

Why it matters?

EVTAR is important because it makes virtual try-on technology much more practical and user-friendly. By removing the need for complex inputs, it opens the door for easier integration into things like online shopping, allowing people to see how clothes will look on them without a lot of effort, and creating more realistic results.

Abstract

We propose EVTAR, an End-to-End Virtual Try-on model with Additional Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance try-on accuracy. Most existing virtual try-on approaches rely on complex inputs such as agnostic person images, human pose, densepose, or body keypoints, making them labor-intensive and impractical for real-world applications. In contrast, EVTAR adopts a two-stage training strategy, enabling simple inference with only the source image and the target garment inputs. Our model generates try-on results without masks, densepose, or segmentation maps. Moreover, EVTAR leverages additional reference images of different individuals wearing the same clothes to preserve garment texture and fine-grained details better. This mechanism is analogous to how humans consider reference models when choosing outfits, thereby simulating a more realistic and high-quality dressing effect. We enrich the training data with supplementary references and unpaired person images to support these capabilities. We evaluate EVTAR on two widely used benchmarks and diverse tasks, and the results consistently validate the effectiveness of our approach.

View Paper