Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

2025-12-08

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Summary

This paper focuses on creating realistic, moving 3D scenes from just a single still image, which is a really hard problem in computer graphics and artificial intelligence.

What's the problem?

Currently, most methods for doing this either first create the 3D shape and *then* add motion, or vice versa. This separation often leads to scenes that don't look quite right – things might move in unnatural ways or the 3D shape might change unexpectedly when motion is added. Also, there wasn't a lot of good data available to train these kinds of systems, making it hard to get realistic results.

What's the solution?

The researchers developed a system called MoRe4D, which stands for Motion and Reconstruction for 4D Synthesis. They tackled the data problem by creating a new, large dataset called TrajScene-60K, containing 60,000 videos with detailed motion information. Then, they used a technique called diffusion modeling to generate both the 3D structure and the motion *at the same time*, ensuring they work together seamlessly. They also added clever tricks to make sure the motion looks natural and the 3D shape stays consistent as things move. Finally, they built a module to render videos from any camera angle based on the generated 3D scene and motion.

Why it matters?

This work is important because it allows for the creation of more realistic and immersive virtual experiences from a single image. Imagine being able to take a picture of a park and then virtually walk around in it, watching people move and trees sway, all generated from that one photo. This has potential applications in areas like virtual reality, movie making, and robotics.

Abstract

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

View Paper