VOID: Video Object and Interaction Deletion
Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
2026-04-03
Summary
This paper introduces a new method, called VOID, for removing objects from videos in a way that looks realistic, even when the object interacts with other things in the scene.
What's the problem?
Current video editing tools are good at filling in the space where an object used to be and fixing things like shadows, but they struggle when the object being removed was actually *doing* something – like leaning on another object or causing something to move. This leads to videos where things don't make sense after the object is gone, because the interactions aren't corrected.
What's the solution?
The researchers created a system that first figures out which parts of the video were affected by the object being removed. Then, it uses a powerful AI model that understands both images and language to predict what should happen in those areas to make the scene physically realistic. They trained this model using a special dataset created with computer graphics tools that simulate how objects interact, so the AI learns to predict those interactions correctly.
Why it matters?
This work is important because it pushes video editing AI to be more than just visually appealing; it aims to make these tools understand the *physics* of the world. By making edits that are physically plausible, we can create more believable and realistic videos, and it helps move AI closer to being a true simulator of real-world events.
Abstract
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.