EgoX: Egocentric Video Generation from a Single Exocentric Video

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo

2025-12-15

EgoX: Egocentric Video Generation from a Single Exocentric Video

Summary

This paper introduces a new method, called EgoX, for automatically creating first-person (egocentric) videos from standard third-person videos. It's about changing the viewpoint of a video to feel like you're seeing it through someone else's eyes.

What's the problem?

Taking a video filmed from a normal perspective and turning it into a first-person view is really hard. The camera angles can change dramatically, and there's often very little overlap between what's visible in the original video and what *should* be visible from the new, first-person perspective. The system needs to fill in the missing parts realistically and make sure everything looks geometrically correct – meaning objects don't warp or appear in impossible positions.

What's the solution?

The researchers used a powerful type of artificial intelligence called a video diffusion model, which has already learned a lot about how videos work. They fine-tuned this model using a technique called LoRA, which is efficient. They also developed a clever way to combine information from both the original video and what a first-person view typically looks like. Finally, they added a mechanism that helps the AI focus on the most important parts of the scene to maintain a realistic and consistent image.

Why it matters?

This work is important because it could lead to more immersive virtual reality experiences. Imagine being able to experience any video as if you were actually there! It also has potential applications in robotics and self-driving cars, where understanding the world from a first-person perspective is crucial.

Abstract

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

View Paper