EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
2026-02-27
Summary
This paper introduces a new way to record human movement and the surrounding environment using just two iPhones, creating a dataset called EmbodMocap.
What's the problem?
Currently, capturing realistic human motion data for training robots or virtual characters is difficult and expensive, often requiring specialized studios with lots of equipment or people wearing bulky sensors. This limits how much data can be collected in real-world settings, which are much more varied and useful for training AI.
What's the solution?
The researchers developed a system where two iPhones, while being moved around, simultaneously record video and depth information. They then use clever software to combine these recordings and create a 3D reconstruction of both the person and the environment around them, all without needing a fixed camera setup or markers on the person. This allows for accurate motion capture in everyday places.
Why it matters?
This research is important because it makes it much easier and cheaper to collect large amounts of data about how people move in the real world. This data can then be used to improve AI systems that need to understand and interact with the physical world, like robots learning to perform tasks or virtual characters behaving more realistically.
Abstract
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.