SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu

2025-10-28

SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Summary

This paper introduces a new method, called SyncHuman, for creating realistic 3D models of people from just a single 2D image. It's aimed at improving how characters are made for things like movies and video games.

What's the problem?

Creating a full 3D model of a person from just one picture is really hard. The computer has to 'guess' what the hidden parts of the body look like, and existing methods often struggle with getting the shape right, especially when the person is in a difficult pose, or when trying to capture small details like wrinkles in clothing. Previous approaches relied on a basic 3D human shape model (SMPL) which isn't always accurate enough, and they have trouble with complex poses and fine details.

What's the solution?

SyncHuman solves this by combining two different types of AI models. One model is good at creating detailed 2D images from different viewpoints, but isn't great at making sure the 3D shape is consistent. The other model creates a more basic, but structurally sound, 3D shape. The researchers 'trained' these models together, using a technique called 'synchronization attention' to make sure the 2D details and 3D shape line up correctly. Then, they added a way to 'inject' the fine details from the 2D images onto the 3D model, making it look much more realistic.

Why it matters?

This research is important because it significantly improves the quality of 3D human models created from single images. This could make it much easier and cheaper to create characters for movies, video games, and other applications, and it shows a promising new direction for building more advanced 3D generation models.

Abstract

Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.

View Paper