Whole-Body Conditioned Egocentric Video Prediction
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
2025-06-27
Summary
This paper talks about a new AI model that predicts what a video taken from a person's perspective will look like in the future based on their whole body movements.
What's the problem?
The problem is that it's hard for computers to predict how the world will change in a first-person video because they need to understand both the person's complex body movements and how those movements affect what the camera sees over time.
What's the solution?
The researchers trained a model using real-world videos and detailed body pose data, teaching it to predict future video frames step by step by focusing on the relationship between the person's full-body actions and the changing view from their perspective. They also tested the model on different levels, from simple movements to long-term predictions and planning.
Why it matters?
This matters because accurately predicting future views from a person's perspective can help robots and AI better understand and interact with the world, improving things like virtual reality, robotics, and assistive technology.
Abstract
A model trained on real-world egocentric video and body pose predicts video from human actions using an auto-regressive conditional diffusion transformer, evaluated with a hierarchical protocol of tasks.