SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu

2025-11-25

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Summary

This paper introduces a new method, SyncMV4D, for creating realistic videos of hands interacting with objects. It aims to generate these videos from multiple viewpoints, along with accurate 3D motion data.

What's the problem?

Currently, creating these kinds of videos has limitations. Most methods only use information from a single camera angle, which makes it hard to accurately represent the 3D shapes and movements. Other methods that *do* use 3D data require very controlled lab environments, meaning they don't work well with real-world videos taken in everyday situations.

What's the solution?

SyncMV4D solves this by combining information from multiple cameras and using a clever process. It uses something called a 'diffusion model' to initially create both the video and a rough version of the 3D motion. Then, it refines the 3D motion to make it more accurate and consistent across all camera views. Importantly, the video and 3D motion constantly influence each other during creation – the video helps improve the motion, and the motion helps improve the video, creating a feedback loop.

Why it matters?

This research is important because realistic hand-object interaction videos are crucial for things like creating believable animations and training robots to interact with the world. By creating a method that works well with real-world videos and generates accurate 3D data, SyncMV4D opens up possibilities for more advanced and practical applications in these fields.

Abstract

Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

View Paper