SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
Mingqi Gao, Yunqi Miao, Jungong Han
2025-12-10
Summary
This paper introduces a new method, SAM-Body4D, for creating accurate 3D models of people from videos. It builds upon existing technology that works well with single images, but improves it to handle the challenges of video – like things moving quickly or parts of the body being hidden.
What's the problem?
Current methods for creating 3D human models from videos often struggle with consistency between frames, meaning the model can 'jump' around or look unnatural. They also have trouble when parts of a person are blocked from view, like behind another object. Applying image-based methods to each frame of a video individually doesn't take advantage of the fact that people generally move smoothly and continuously.
What's the solution?
SAM-Body4D solves this by first identifying and consistently labeling each person throughout the video using a video segmentation model. It then fills in missing parts of the person's shape when they are temporarily hidden. Finally, it uses this improved information to guide a 3D model creation process, resulting in a smoother and more accurate 3D reconstruction of the person's movements over time. Importantly, this all happens without needing to retrain the existing 3D model creation technology.
Why it matters?
This work is important because it makes 3D human pose and shape estimation from videos much more reliable. This has a lot of potential applications, like improving motion capture for animation, creating more realistic virtual reality experiences, and helping computers better understand human behavior in real-world scenarios. The fact that it doesn't require extra training makes it easy to use with existing tools.
Abstract
Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.