SS4D: Native 4D Generative Model via Structured Spacetime Latents

Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin

2025-12-17

SS4D: Native 4D Generative Model via Structured Spacetime Latents

Summary

This paper introduces a new way to create realistic, moving 3D objects from just a single video. It's called SS4D, and it directly generates these 4D (3D plus time) models without needing to first build them from separate 3D models or videos.

What's the problem?

Creating 4D models of objects – meaning 3D objects that change over time – is really hard. Existing methods either try to piece together 3D shapes across frames, or they work with videos but don't always create consistent, realistic movements. Also, getting enough training data for these models is a challenge because 4D data is scarce.

What's the solution?

The researchers tackled this by building a system that learns directly from 4D data, but they cleverly overcame the data shortage. They started with a model that's good at creating 3D objects from single images, which gives the system a strong base for spatial consistency. Then, they added special layers that focus on making the movement look natural and smooth over time. To handle long videos efficiently, they compressed the information representing the movement, making the process faster and less demanding on computers. They also used a training method that helps the model deal with parts of the object being hidden from view.

Why it matters?

This work is important because it offers a more direct and effective way to generate dynamic 3D objects. This could be useful in a lot of areas, like creating realistic characters for video games, generating training data for robots, or even making special effects for movies. By directly learning from 4D data and addressing the challenges of data scarcity and computational cost, SS4D represents a significant step forward in 4D computer graphics.

Abstract

We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

View Paper