EgoSim: Egocentric World Simulator for Embodied Interaction Generation
Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu
2026-04-03
Summary
This paper introduces EgoSim, a new computer program that creates realistic videos from a first-person perspective, like what you'd see if you were wearing a camera. It's designed to simulate how someone interacts with the world around them, and importantly, it remembers what happens in the simulation, making it continuous.
What's the problem?
Current programs that try to do this either don't accurately represent the 3D structure of the environment, leading to things looking wrong when you move your viewpoint, or they treat the world as unchanging. Imagine trying to simulate building a tower with blocks – existing simulators can't really handle the blocks changing position as you stack them. They struggle with interactions that happen over time and affect the scene.
What's the solution?
The researchers solved this by building a simulator that keeps track of a 3D model of the environment and updates it as interactions happen. They created a system that uses information about shapes, actions, and how things interact to generate realistic videos. To get enough data to train this system, they developed a way to automatically extract information from lots of real-world videos taken with regular cameras, and even a simple system for collecting new data using smartphones.
Why it matters?
This work is important because it creates a more realistic and useful simulation environment. This could be used to train robots to perform tasks in the real world, develop better virtual reality experiences, or even improve our understanding of how people interact with their surroundings. The ability to simulate complex, multi-step interactions is a big step forward, and the system’s ability to work with data from everyday cameras makes it more accessible.
Abstract
We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.