RealMaster: Lifting Rendered Scenes into Photorealistic Video

Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar

2026-03-25

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Summary

This paper introduces RealMaster, a new technique for making videos created with 3D engines look incredibly realistic, like they were filmed in the real world.

What's the problem?

Currently, there's a trade-off when creating videos. You can have very realistic videos, but they're hard to control precisely, or you can have perfectly controlled videos from 3D engines, but they often look artificial and 'uncanny'. The issue is getting the best of both worlds: precise control *and* photorealistic visuals. Existing methods struggle to maintain the exact structure and movement from the 3D scene while also making it look truly real.

What's the solution?

RealMaster uses a type of artificial intelligence called a video diffusion model to 'upgrade' videos rendered from 3D engines. It's trained using a special dataset where the first and last frames of a video are made realistic, and then that realism is smoothly applied to all the frames in between, guided by the 3D scene's geometry. They also use a technique called IC-LoRA to make the model more flexible, so it can handle new objects and characters appearing in the video without needing special setup frames.

Why it matters?

This work is important because it helps bridge the gap between the virtual world of 3D engines and the real world. This means game developers, filmmakers, and anyone creating visual content can have more control over their creations while still achieving a high level of realism, potentially saving time and resources compared to traditional methods.

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

View Paper