The core pipeline of 3DV-TON is adaptive and multi-stage. It begins by selecting a keyframe from the input video for initial 2D image try-on. Next, it reconstructs and animates a textured 3D mesh that is synchronized with the original video’s body poses. This mesh acts as a dynamic guide, informing the diffusion model on how the garment should move and deform across frames. To further enhance visual quality, 3DV-TON incorporates a robust rectangular masking strategy that prevents artifact propagation and mitigates the risk of leaking clothing information during rapid movements. This ensures that even in challenging scenarios-such as fast motion or complex interactions between the body and garments-the generated results remain free of visual artifacts.
To support rigorous evaluation and foster research in the field, 3DV-TON introduces the HR-VVT benchmark dataset, which contains 130 high-resolution videos featuring a wide variety of clothing types and scenarios. Quantitative and qualitative assessments demonstrate that 3DV-TON consistently outperforms existing video try-on methods, particularly in terms of temporal consistency and garment detail preservation. The framework’s innovative combination of diffusion modeling, 3D mesh guidance, and artifact mitigation strategies positions it as a leading solution for realistic, high-quality video try-on, suitable for applications in fashion, e-commerce, and digital content creation.