HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu
2026-03-17
Summary
This paper introduces HSImul3R, a new system for creating 3D models of people interacting with objects and environments from everyday videos or a few pictures, specifically designed to work well with physics simulations.
What's the problem?
Currently, creating 3D models of human-scene interactions from simple recordings is difficult because the resulting models often *look* good but aren't physically realistic. This means they fall apart when you try to use them in a physics engine, like when trying to simulate a robot picking up an object – the simulation becomes unstable and doesn't work correctly. There's a disconnect between how things look and how they behave in the real world.
What's the solution?
HSImul3R solves this by using a two-way process that constantly checks and improves the 3D model using a physics simulator. First, it uses 'reinforcement learning' to make the human movements in the model look natural and ensure they maintain stable contact with objects. Then, it uses feedback from the simulator – like whether an object falls over due to gravity or if a grasp is successful – to adjust the shape and properties of the objects in the scene, making them more physically accurate. They also created a new set of test scenarios, called HSIBench, to demonstrate how well their system works.
Why it matters?
This research is important because it allows for the creation of realistic 3D environments and human models that can be directly used for training robots and developing 'embodied AI' – AI that exists in and interacts with the physical world. By bridging the gap between visual realism and physical accuracy, HSImul3R makes it possible to build simulations that are more reliable and can be directly applied to real-world robotics applications.
Abstract
We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.