DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
Hyeonwoo Kim, Jeonghwan Kim, Kyungwon Cho, Hanbyul Joo
2026-04-23
Summary
This paper introduces a new method, called DeVI, for teaching robots how to interact with objects in a skillful way, like a human would. It uses videos created by computers, rather than relying on recordings of actual people, to train the robot.
What's the problem?
Teaching robots to manipulate objects with dexterity is hard. While computer-generated videos can *show* complex interactions, they aren't perfectly realistic and only exist from one viewpoint (2D). This makes it difficult for a robot to directly learn from these videos and actually *perform* the actions in the real world, where physics matters (3D). Existing methods often need very precise 3D data, which is expensive and limits what the robot can do.
What's the solution?
DeVI solves this by using these computer-generated videos, but it adds a clever trick. It combines tracking the human in the video in 3D with tracking the object in 2D. This 'hybrid tracking reward' helps the robot understand what it should be doing, even with the imperfections of the video. The robot only needs the video itself to learn, meaning it can handle new objects and tasks it hasn't seen before without needing extra data.
Why it matters?
This work is important because it allows robots to learn complex manipulation skills from readily available video data. It’s a step towards robots that can adapt to new situations and objects without needing to be specifically programmed for each one, and it shows that video can be a powerful tool for planning robot movements.
Abstract
Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.