DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies
Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang
2026-01-06
Summary
This paper introduces a new method, called DiffProxy, for creating 3D models of people from multiple 2D images. It focuses on improving the accuracy of these models, especially when dealing with real-world photos that aren't perfect.
What's the problem?
Creating accurate 3D human models from images is tough because the data used to train the computer programs often has flaws. If you use real-world images, the 'ground truth' (the correct 3D model) might be inaccurate, which can teach the program to make mistakes. If you use perfectly labeled synthetic data, it doesn't always look realistic enough to work well on real photos – there's a disconnect between the two. Essentially, training data is either flawed or doesn't translate well to the real world.
What's the solution?
DiffProxy tackles this by generating 'proxy' 3D models that are consistent from all viewpoints. It uses a technique called diffusion modeling, which is a type of AI that's good at creating realistic images and shapes. The system creates these proxies using synthetic data, ensuring accuracy, and then uses the diffusion model to make them look more realistic and adaptable to real-world photos. It also includes a way to refine details like hands and a method to handle difficult cases during the 3D model creation process.
Why it matters?
This research is important because it allows for more accurate 3D human modeling using only synthetic training data. This is a big deal because getting accurate labels for real-world images is expensive and time-consuming. By achieving state-of-the-art results on several real-world datasets, even with challenging conditions like partial views or objects blocking the person, DiffProxy shows a significant step forward in making 3D human reconstruction more reliable and accessible.
Abstract
Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html