ROSE: Remove Objects with Side Effects in Videos
Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao
2025-08-29

Summary
This paper introduces a new method, called ROSE, for removing objects from videos and, importantly, also getting rid of the visual clues they leave behind like shadows and reflections. It's about making it look like the object was never there in the first place, a tricky problem for current video editing technology.
What's the problem?
Existing video editing tools are really good at removing objects themselves, but they struggle with the 'side effects' those objects create in the video. Think about a person casting a shadow, or their reflection in a window. If you just remove the person, the shadow and reflection still remain, which looks unnatural. The main issue is that there isn't much video footage available where objects *and* their side effects are perfectly recorded alongside a version *without* the object, making it hard to train these tools.
What's the solution?
The researchers tackled this problem by creating their own training data using a 3D rendering engine. This allowed them to generate tons of videos with objects and their shadows, reflections, and other effects, along with corresponding videos *without* the objects. They then built a video editing model, based on a technique called diffusion transformers, that not only removes the object but also predicts and removes the areas affected by these side effects. The model looks at the entire video to understand what needs to be erased and uses the synthetic data to learn how to do it well. They also created a new set of videos, ROSE-Bench, to specifically test how well their method handles these tricky side effects.
Why it matters?
This work is important because it significantly improves the realism of video object removal. By addressing the often-overlooked issue of side effects, ROSE makes it possible to edit videos in a way that looks much more natural and believable. This has applications in filmmaking, video editing for social media, and potentially even restoring old or damaged footage.
Abstract
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.