Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

2025-12-16

Inferring Compositional 4D Scenes without Ever Seeing One

Summary

This paper introduces a new method, COM4D, for creating 4D models of scenes from regular 2D videos. It aims to understand how objects move and interact in a scene over time, building a complete picture of both their shape and their motion.

What's the problem?

Currently, building 4D models of real-world scenes is really difficult. Most existing methods focus on tracking just *one* object at a time, and they often rely on pre-defined shapes for those objects. This means they can't handle scenes with many different objects, or objects that don't fit into those pre-defined shapes, leading to incomplete or inaccurate scene reconstructions. Essentially, it's hard to get a consistent and comprehensive understanding of a dynamic scene.

What's the solution?

COM4D tackles this by learning to pay attention to both the spatial arrangement of objects *and* how they change over time, all from standard 2D video. It's trained in two parts: first, it learns how objects are composed together in a scene, and second, it learns how individual objects move. Then, it cleverly combines these two types of learning to reconstruct the entire 4D scene without needing any examples of scenes with multiple moving objects during training. It alternates between understanding where things are and how they're changing to build a complete and lasting model.

Why it matters?

This work is important because it allows for the creation of detailed 4D scene models from just regular videos, without needing special equipment or pre-defined object shapes. It's a step towards computers being able to truly understand and interpret the dynamic world around us, and it even improves performance on existing tasks like reconstructing individual objects or 3D scenes, showing its broad applicability.

Abstract

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

View Paper