Over++: Generative Video Compositing for Layer Interaction Effects

Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman

2025-12-23

Over++: Generative Video Compositing for Layer Interaction Effects

Summary

This paper introduces a new technique for adding realistic environmental effects, like shadows and reflections, to videos. It focuses on making these effects look natural and blend seamlessly with the original footage, all while being controlled by simple text instructions.

What's the problem?

Currently, adding these kinds of effects to videos is done by hand, which takes a lot of time and skill. Existing computer programs struggle to add effects without messing up the original video or require a lot of detailed information about each frame, like precise outlines of objects, which is also time-consuming. They often produce effects that don't look quite right or aren't consistent throughout the video.

What's the solution?

The researchers developed a system called Over++ that automatically generates these environmental effects. It doesn't need to know anything about the camera's position, how the scene is moving, or the depth of objects in the video. They also created a special dataset to train the system and a clever way to improve its performance even with limited training data. You can even give it hints by roughly outlining where you want the effect, or specify key moments for changes.

Why it matters?

This work is important because it could significantly speed up the video editing process for professionals. It allows artists to quickly and easily add complex environmental effects to their videos without needing to do everything manually, leading to more realistic and visually appealing results. It also opens the door to new creative possibilities by making it easier to experiment with different effects.

Abstract

In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

View Paper