EgoLCD: Egocentric Video Generation with Long Context Diffusion
Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang
2025-12-05
Summary
This paper introduces a new system called EgoLCD designed to create realistic, long videos from a first-person perspective, like what you'd see if a camera was attached to someone's head.
What's the problem?
Creating these long videos is really hard because the system needs to 'remember' what objects look like and how things work over a long period of time. Existing methods often struggle with this, leading to videos where objects change appearance or the scene doesn't make sense as time goes on – it's like the system 'forgets' what it was doing.
What's the solution?
EgoLCD tackles this by using a combination of memory systems. It has a long-term memory that reliably stores important details about the scene, and a short-term memory that focuses on what's happening right now. They also added a way to fine-tune the system and a special technique to make sure the memory is used efficiently, plus they give the system clear instructions about how events should unfold over time.
Why it matters?
This work is a big step forward for creating AI that can understand and interact with the world like humans do. Being able to generate realistic, long-term videos is crucial for training AI agents to perform tasks in the real world, like cooking or fixing things, and it helps build more sophisticated 'world models' for artificial intelligence.
Abstract
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.