UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu
2025-10-10
Summary
This paper introduces UP2You, a new computer vision system that creates detailed 3D models of people wearing clothes directly from everyday photos you’d find online or take with your phone.
What's the problem?
Existing methods for creating 3D models of people from photos are really picky about the input images. They usually need full-body shots taken in good lighting, from specific angles, and without anything blocking the view. This makes it hard to use real-world photos, which are often messy, partially hidden, and taken from random angles. Also, many methods are slow because they have to process a lot of information step-by-step.
What's the solution?
UP2You solves this by first ‘cleaning up’ the messy input photos and turning them into a set of consistent, clear images as if they were taken in a studio. It does this very quickly using a new technique called a ‘data rectifier’. Then, it uses a special module, called PCFA, to smartly combine information from different photos, focusing on the parts that are most relevant to the pose of the person in the 3D model. It also doesn’t need a pre-made template of a human body, instead predicting the shape directly from the photos. This whole process takes about 90 seconds per person.
Why it matters?
This is a big step forward because it means you can create realistic 3D models of people from almost any photo. This has lots of potential applications, like virtual try-on for clothes shopping, creating avatars for games or the metaverse, or even helping with personalized fashion recommendations. Because it’s fast and doesn’t require special equipment, it’s much more practical for everyday use than previous methods.
Abstract
We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You