H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou

2025-12-12

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Summary

This paper presents a new way to teach robots how to manipulate objects by simply showing them videos of humans doing the same tasks, without needing any special robot training data.

What's the problem?

Traditionally, getting robots to perform complex manipulation tasks requires a lot of time and effort collecting data of the robot actually *doing* those tasks. This is slow and expensive. The core issue is the 'embodiment gap' – robots and humans are physically different, so it's hard to directly transfer knowledge from human videos to robot actions. Existing methods often produce unrealistic or jerky robot movements.

What's the solution?

The researchers developed a system that translates human action videos into realistic robot action videos. It works by first 'cleaning up' the human videos, removing the person and adding a visual guide showing where the robot's 'hand' (gripper) should be. Then, a powerful video generation model learns to insert a robot arm into the scene, mimicking the human's movements. Importantly, this system only needs videos of robots moving around generally – it doesn't need videos of robots performing the *specific* tasks it's learning. They also used a technique called 'in-context learning' to make the generated robot movements smoother and more natural.

Why it matters?

This research is important because it opens the door to robots learning a much wider range of skills simply by watching us. It reduces the need for expensive and time-consuming robot-specific data collection, making it easier to deploy robots in everyday environments and have them perform useful tasks. The ability to learn from unlabeled human videos is a big step towards more adaptable and intelligent robots.

Abstract

Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/

View Paper