Human3R: Everyone Everywhere All at Once
Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll
2025-10-08
Summary
This paper introduces Human3R, a new computer vision system that can create a complete 3D reconstruction of a scene with people in it, directly from a regular video. It builds a 3D model of the environment and accurately tracks the movements and shapes of multiple people within that scene, all at the same time.
What's the problem?
Existing methods for creating 3D reconstructions from video are often complicated, requiring multiple steps and relying on other technologies like human detection, depth estimation, and SLAM. These separate parts can introduce errors and make the process slow and resource-intensive. They also often need a lot of data to work well and struggle with reconstructing everything – people, the scene, and the camera’s movement – simultaneously.
What's the solution?
Human3R solves this by using a single, streamlined process. It takes a video as input and directly outputs the 3D models of the people, the scene, and the camera’s path, all in one go. It’s based on a previous model called CUT3R, but uses a technique called 'visual prompt tuning' to improve its ability to recognize and reconstruct multiple people. Importantly, it can achieve this quickly – running at 15 frames per second – and doesn’t require a lot of computing power.
Why it matters?
This work is significant because it provides a much simpler and more efficient way to create 3D reconstructions from video. It performs as well as or better than existing methods, but with less complexity and faster processing. This makes it a strong foundation for future research and could be used in a variety of applications, like virtual reality, robotics, and creating 3D content.
Abstract
We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R