InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo
2025-12-29
Summary
This paper introduces a new system called InsertAnywhere that focuses on realistically adding objects into existing videos.
What's the problem?
Adding objects into videos convincingly is really hard because computers struggle to understand what's happening in a video over time, and it's difficult to make the new object look like it actually belongs there, especially when things in the video move or block the object's view. Existing methods often create insertions that don't quite fit the scene's geometry or lighting.
What's the solution?
InsertAnywhere tackles this by first creating a detailed understanding of the video's 3D structure and how things move throughout the video. It then uses this understanding to place the new object in a way that makes sense geometrically and maintains consistency from frame to frame. Finally, it uses a powerful video generation technique, based on diffusion models, to seamlessly blend the object into the video, adjusting the lighting and shadows to match the surrounding environment. To train the system, the researchers created a new dataset with videos showing objects being added and removed, along with helpful reference images.
Why it matters?
This research is important because it significantly improves the quality of video object insertion, making it much more realistic. This has potential applications in areas like special effects for movies, creating personalized videos, and even virtual reality experiences where you might want to add objects to existing footage.
Abstract
Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.