Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang, Chun-Han Yao, Simon Donné, Narendra Ahuja, Varun Jampani

2025-09-17

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Summary

This paper introduces a new method called Stable Part Diffusion 4D, or SP4D, which creates realistic videos of objects showing both their color and how their parts move. It takes a single video as input and generates a corresponding video showing the object and how its individual pieces are moving, like a skeleton with skin.

What's the problem?

Existing methods for identifying object parts in videos usually focus on how things *look*, which can be unreliable because lighting and appearance change. They don't necessarily understand how the parts are actually connected and move together. This makes it hard to use these part segmentations for things like animation or controlling virtual objects.

What's the solution?

SP4D uses a special type of artificial intelligence called a diffusion model, but it's designed with two parts working together: one that generates the color video and another that creates a map showing the different parts. To make things simpler, the paper uses a clever trick of representing each part as a color, allowing both parts of the AI to learn from each other. They also added a system to make sure the parts stay consistent over time and across different views. Finally, they created a large dataset of 3D objects with labeled parts to train and test their system.

Why it matters?

This research is important because it allows computers to better understand how objects are structured and how their parts move. This is crucial for creating more realistic animations, controlling robots, and developing virtual reality experiences where objects behave in a natural way. The ability to generate this information from a single video makes it much more practical for real-world applications.

Abstract

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

View Paper