WHAC: World-grounded Humans and Cameras

Wanqi Yin, Zhongang Cai, Ruisi Wang, Fanzhou Wang, Chen Wei, Haiyi Mei, Weiye Xiao, Zhitao Yang, Qingping Sun, Atsushi Yamashita, Ziwei Liu, Lei Yang

2025-02-24

Summary

This paper talks about WHAC, a new system that can figure out where people and cameras are moving in 3D space by looking at a regular video, which is really hard to do accurately.

What's the problem?

It's tough to tell exactly where people and cameras are moving in the real world just by looking at a 2D video. Current methods struggle to get the right scale and often can't handle situations where both the person and the camera are moving.

What's the solution?

The researchers created WHAC, which uses two clever ideas. First, it uses existing methods to guess how far away a person is in the video. Second, it looks at how people move to figure out real-world distances. WHAC combines these ideas to track both people and cameras in 3D space without needing complicated math. They also made a new fake video dataset called WHAC-A-Mole to test their system.

Why it matters?

This matters because it could help improve things like virtual reality, robotics, and sports analysis. By understanding how people and cameras move in 3D space from just a normal video, we can create more realistic computer graphics, better video games, and smarter robots that can interact with people more naturally.

Abstract

Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, <PRE_TAG>WHAC-A-Mole</POST_TAG>, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.

View Paper