Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu

2026-03-18

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Summary

This paper introduces Kinema4D, a new way to simulate how robots interact with the world using artificial intelligence. It focuses on creating realistic 4D (three dimensions of space plus time) simulations that allow for precise robot control and believable environmental reactions.

What's the problem?

Current robot simulation methods often fall short because they either operate in simplified 2D environments or rely on pre-set environmental responses. Real-world robot interactions are complex and happen over time – they’re 4D events – and require accurately modeling how the environment *reacts* to the robot’s actions. Existing simulators don't capture this dynamic, interactive element well, limiting their usefulness for training robots to operate in the real world.

What's the solution?

The researchers developed Kinema4D, which breaks down the simulation into two key parts. First, it precisely controls a 3D robot model using its internal mechanics (URDF) to create a detailed 4D trajectory of the robot’s movements. Second, it takes this robot movement and uses it to generate realistic reactions from the environment, creating synchronized video and depth data. They also created a large dataset, Robo4D-200k, with over 200,000 examples of robot interactions to help train the system.

Why it matters?

Kinema4D is important because it’s the first simulator to effectively recreate the full 4D nature of robot-world interactions with precise control. This allows for more realistic training of robots in simulation, and importantly, shows the potential for robots trained in Kinema4D to perform well in the real world without needing further training – a capability called 'zero-shot transfer'. This could significantly speed up the development and deployment of robots in various applications.

Abstract

Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

View Paper