CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma
2026-04-22
Summary
This paper introduces a new system called CoInteract that creates realistic videos of people interacting with objects, like someone using a product shown in an ad.
What's the problem?
Current video generation technology, while good at making things *look* real, struggles with two key things when it comes to people and objects. First, it often messes up the details of things like hands and faces, making them look distorted or unnatural. Second, it doesn't always create interactions that make physical sense – sometimes hands will go *through* objects, which obviously isn't realistic.
What's the solution?
CoInteract tackles these problems with a two-part approach built on a powerful existing video generation model. First, it uses a 'Human-Aware Mixture-of-Experts' which is like having specialized mini-experts focus on specific parts of the person, like hands or the face, to make sure those areas are rendered correctly. Second, it uses 'Spatially-Structured Co-Generation' which essentially trains the system to understand *how* people and objects should interact geometrically, ensuring things don't clip through each other and the movements look natural. The system learns this interaction information during training, but doesn't need it during the actual video creation, keeping it efficient.
Why it matters?
This research is important because creating realistic videos of people interacting with products is crucial for things like online shopping, advertising, and virtual demonstrations. Better video generation means more engaging and believable content, which can lead to increased sales and a better user experience.
Abstract
Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.