WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, Shanghang Zhang

2025-10-09

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Summary

This paper introduces WristWorld, a new computer model that can create realistic videos showing a robot's hand from a wrist-mounted camera's perspective, even if it only has video from a different camera angle to start with.

What's the problem?

Robots learning to manipulate objects need to 'see' what they're doing from different viewpoints, and a wrist-mounted camera provides a really useful, close-up view. However, it's hard to get lots of training data from these wrist cameras, meaning robots aren't very good at understanding and predicting what will happen when they interact with objects. Existing models can't create these wrist-view videos from standard camera angles because they need a wrist-view starting point.

What's the solution?

The researchers built WristWorld, which works in two steps. First, it reconstructs what the scene would look like from the wrist camera, creating a 3D understanding of the objects and the robot's hand. They improved an existing model called VGGT and added a new way to make sure the reconstructed view is geometrically accurate. Second, it uses this reconstructed view to generate a realistic video sequence, showing the robot's hand interacting with objects as if the camera were actually mounted on the wrist.

Why it matters?

This is important because WristWorld allows robots to learn more effectively even with limited wrist-view data. By generating realistic training videos, it improves the robot's ability to perform tasks, increasing task completion rates and reducing the difference between how well a robot performs using a standard view versus a wrist view.

Abstract

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

View Paper