PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen
2025-12-22
Summary
This research focuses on making robots better at understanding the world from their own perspective, like how humans do, so they can perform complex tasks. It's about giving robots 'physical intelligence' – the ability to figure out how things change when they interact with them.
What's the problem?
Current robots often learn from videos taken by someone *watching* them, not from the robot's own 'eyes'. This is a problem because seeing the world from a third-person view is very different from experiencing it as the robot does. Getting enough videos of robots actually *doing* things is expensive and doesn't show enough variety, but there's a lot of video available of people doing things from their own viewpoint. The challenge is how to use these human videos to teach robots effectively.
What's the solution?
The researchers created a system called 'Egocentric2Embodiment' that takes regular first-person videos (like someone cooking or building something) and turns them into a special kind of training data for robots. This data focuses on understanding *why* things happen and making sure the robot understands the steps in order. They also built a large dataset, E2E-3M, using this process and then trained a robot 'brain', called PhysBrain, on this data.
Why it matters?
This work is important because it allows robots to learn more efficiently from readily available human videos. By understanding the world from their own perspective, robots can plan and execute tasks more successfully, and it makes them more adaptable to new situations. The results showed a significant improvement in the robot's ability to plan and perform tasks, demonstrating that learning from human experience can be a powerful tool for robotics.
Abstract
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.