EgoTwin: Dreaming Body and View in First Person
Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu
2025-08-25
Summary
This paper focuses on creating realistic videos from a first-person perspective, like what you'd see if you were wearing a camera, and making sure the movements in the video match what a person would naturally do while filming. It's about generating both the video *and* the body motions at the same time.
What's the problem?
Generating videos from a first-person view is hard because it requires two things to work together perfectly. First, the camera's movement in the video needs to match the head movements of the person 'wearing' the camera. Second, the actions of the person in the video should logically cause the things you see to happen – for example, if someone reaches out, you should see them touch something. Existing methods are good at making videos from an outside perspective, but struggle with this coordination in first-person views.
What's the solution?
The researchers developed a system called EgoTwin. It uses a powerful type of artificial intelligence called a diffusion transformer. EgoTwin focuses on representing human motion relative to the head, meaning it tracks how the head is moving and builds the body movements around that. It also includes a way for the video and motion to 'talk' to each other during the creation process, ensuring that actions in the video cause realistic movements and vice versa. They also created a large dataset of videos with matching motion data to train and test their system.
Why it matters?
This work is important because it brings us closer to creating truly immersive virtual reality experiences or realistic simulations. If we can automatically generate first-person videos with accurate and natural movements, it opens up possibilities for training, entertainment, and even helping people with disabilities by creating assistive technologies.
Abstract
While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.