WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang

2025-09-19

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Summary

This paper introduces a new method called WorldForge for creating videos from text prompts while having much more control over the movement within the video, and doing so without needing to retrain the AI model.

What's the problem?

Current AI models that generate videos are really good at understanding what things *look* like, but they struggle with making those things move in a precise and realistic way. If you want to control the motion, you usually have to start over and retrain the entire model, which takes a lot of time and computing power, and can even make the model forget things it already learned about how the world looks.

What's the solution?

WorldForge solves this by working *during* the video creation process, not before or after. It has three main parts: first, it repeatedly refines the video as it's being made to make sure the movement is accurate; second, it separates how things look from how they move, so it can control the movement without messing up the appearance; and third, it checks the movement against a version without guidance to correct any drifting or errors. All of this happens automatically, without any extra training.

Why it matters?

This work is important because it offers a way to easily control the motion in AI-generated videos without the huge cost of retraining. It’s like adding a steering wheel to a self-driving car – you can still let the AI do its thing, but you can also take control when you need to, and it opens up new possibilities for using AI to create realistic and controllable 3D and 4D content.

Abstract

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

View Paper