ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A. Yeh, Peter Wonka, Chaoyang Wang
2025-10-08
Summary
This research focuses on creating a 3D model of something that changes over time, directly from a video. Imagine taking a video of a person dancing and automatically building a 3D model that shows their movements – that's the goal.
What's the problem?
Currently, it's difficult to automatically generate accurate and stable 3D models from videos, especially when things are moving in complex ways, changing shape, or even when parts appear and disappear. Existing methods often struggle with keeping the 3D model consistent from one frame of the video to the next, and they can be prone to errors or 'failures'.
What's the solution?
The researchers developed a new system that uses pre-trained 3D models as a starting point. It pays attention to all parts of the video at once to understand the timing of movements, carefully samples points to build the 3D shape, and makes sure the texture stays consistent over time. A key trick is sharing 'noise' between frames, which helps to smooth out the animation and prevent jitter. This allows the system to create a single, dynamic 3D representation directly from the video without needing to optimize each frame individually.
Why it matters?
This work is important because it improves the quality and reliability of creating 3D models from videos. This has many potential applications, like creating realistic avatars for virtual reality, generating 3D content for movies and games, or even analyzing movements in medical videos. By making the process more robust and accurate, it opens up possibilities for more widespread use of video-based 3D reconstruction.
Abstract
Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.