DiffuEraser: A Diffusion Model for Video Inpainting

Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo

2025-01-24

DiffuEraser: A Diffusion Model for Video Inpainting

Summary

This paper talks about DiffuEraser, a new AI model that can fill in missing parts of videos with realistic and consistent content. It's like having a super-smart digital artist that can seamlessly patch up holes in a video, making it look like nothing was ever missing.

What's the problem?

Current methods for fixing videos with missing parts (called video inpainting) often struggle when there are large areas to fill in. They can make the video look blurry or inconsistent over time, especially when trying to add back big chunks of missing content. It's like trying to patch a big hole in a moving picture, but the patch keeps shifting or doesn't quite match the rest of the video.

What's the solution?

The researchers created DiffuEraser, which uses a special AI technique called stable diffusion. This helps it create more detailed and consistent content to fill in the missing parts. They also added some clever tricks: using prior information to guide the AI (like giving it a rough sketch before it starts painting), and making the AI look at longer sequences of the video to keep things consistent over time. It's like teaching the AI to understand the whole story of the video, not just individual frames, so it can fill in the gaps more naturally.

Why it matters?

This matters because it could make video editing and restoration much better and easier. Imagine being able to remove unwanted objects from videos seamlessly, or restore damaged old films to their former glory. It could be used in movie production, video game development, or even in preserving historical footage. By making the process of fixing videos more effective and efficient, DiffuEraser opens up new possibilities for how we can work with and improve video content.

Abstract

Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.

View Paper