V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang

2025-12-15

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Summary

This paper introduces a new system called V-RGBX that can create and edit videos in a more realistic and controllable way by focusing on the underlying properties of the scene, like color, texture, and lighting.

What's the problem?

Current video generation models are good at making videos *look* real, but they don't really 'understand' what's happening in the scene. They treat everything as just pixels, making it hard to edit videos in a way that respects how light and materials actually work. There wasn't a way to both create videos from these underlying properties and then easily change those properties to edit the video.

What's the solution?

V-RGBX tackles this by first figuring out the intrinsic properties of a video – things like how reflective a surface is or its basic color – through a process called 'inverse rendering'. Then, it uses these properties to build the video. The key is that users can select specific moments (keyframes) and change these properties at those points, and V-RGBX intelligently fills in the gaps to make the changes look natural throughout the entire video. It uses a special 'interleaved conditioning mechanism' to make sure edits are physically realistic.

Why it matters?

This is important because it allows for much more precise and realistic video editing. Instead of just changing colors or adding effects, you can change the actual *material* of an object or the *lighting* in a scene, and the video will update accordingly. This has potential for applications like easily changing the appearance of objects in a video or completely relighting a scene to create a different mood, and it does this better than previous methods.

Abstract

Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

View Paper