EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, Willi Menapace

2025-12-09

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Summary

This research focuses on making it possible to edit videos taken from someone's point of view, like those you'd get from a camera on glasses or a GoPro, using simple text instructions. It's about building tools for interactive augmented reality applications where you can change videos in real-time.

What's the problem?

Editing videos from a first-person perspective is much harder than editing regular videos. These 'egocentric' videos have a lot of shaky movement as the person moves around, and often show hands interacting with objects. Existing video editing AI isn't good at handling these things, and current editing methods are too slow to work in real-time, making it difficult to interact with the edits as they happen.

What's the solution?

The researchers created a whole system to tackle this. First, they built a new dataset called EgoEditData, specifically designed for training AI to edit these kinds of videos, making sure hands are clearly visible. Then, they developed an AI editor called EgoEdit that can understand text instructions and edit the video quickly enough for real-time use on a standard computer graphics card. Finally, they created a set of tests, EgoEditBench, to measure how well the editor performs in terms of following instructions, keeping hands and interactions looking natural, and maintaining a stable video even with lots of movement.

Why it matters?

This work is important because it opens the door to more realistic and interactive augmented reality experiences. Imagine being able to edit a video of yourself building something, and then virtually 're-doing' steps with different objects, all in real-time. By making egocentric video editing more effective, this research brings us closer to that kind of future, and the tools they created will help other researchers build on this work.

Abstract

We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

View Paper