Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou

2025-07-18

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos
with Spatio-Temporal Diffusion Models

Summary

This paper talks about Diffuman4D, a new method that creates smooth and realistic 4D videos of humans moving, even when the original videos have only a few camera views.

What's the problem?

The problem is that existing methods struggle to maintain consistency over time and between different viewpoints when generating new views from limited video data, which causes flickering or unnatural results.

What's the solution?

The authors introduced a sliding iterative denoising process in their 4D diffusion model that gradually cleans up the data while ensuring spatial and temporal consistency. They also use 3D human skeleton data as a guide to improve the quality and coherence of the generated videos.

Why it matters?

This matters because creating high-quality and consistent 4D human videos from sparse inputs helps improve applications in virtual reality, film making, sports analysis, and other areas that need realistic human motion and viewpoints.

Abstract

A sliding iterative denoising process enhances spatio-temporal consistency in 4D diffusion models for high-fidelity view synthesis from sparse-view videos.

View Paper