The method introduced by Diffuman4D tackles the challenge of human novel view synthesis from sparse-view videos using a spatio-temporal diffusion model. The model uses skeleton-Plücker conditioning, where the encoded skeleton latents and Plücker coordinates are concatenated with the image latents at input views or the noise latents at target views. The samples across all views and timestamps form a sample grid, which is denoised by the model using a sliding iterative mechanism and then decoded into the target videos.
Diffuman4D addresses the sparse-view challenge by generating 4D-consistent multi-view videos conditioned on the input videos. The generated videos enable high-quality 4DGS reconstructions, allowing free view rendering of humans in motion. The method has been demonstrated to produce high-fidelity results, and has the potential to be used in a variety of applications such as film, video games, and virtual reality. The Diffuman4D approach is a significant step forward in the field of 4D human view synthesis.