One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

Zhenxing Mi, Yuxin Wang, Dan Xu

2025-11-25

One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

Summary

This paper introduces One4D, a new system that can both create and rebuild moving 3D scenes from images and videos, outputting both realistic color images and detailed 3D point cloud data at the same time.

What's the problem?

Existing methods for creating 3D from video often struggle when you don't have a complete video – like if you only have a few images or a very sparse set of frames. Also, simply adapting techniques that work well for creating depth maps or point clouds individually doesn't work well when you want to generate both realistic color *and* accurate 3D simultaneously, often ruining the quality of the original video generation model.

What's the solution?

The researchers developed a system called One4D that uses a clever trick called 'Decoupled LoRA Control'. This essentially creates separate pathways for processing the color images and the 3D point clouds, allowing each to be refined independently, but still learn to work together to maintain consistency between the 2D image and the 3D shape. They also use a 'Unified Masked Conditioning' method to handle videos with missing or sparse frames effectively.

Why it matters?

This work is a step towards building more complete and realistic digital representations of the world using video. Being able to generate and reconstruct 4D (3D over time) content from limited data has applications in areas like virtual reality, robotics, and creating special effects for movies, and this research provides a more robust and high-quality way to do that.

Abstract

We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D

View Paper