VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu
2025-03-10
Summary
This paper talks about VideoPainter, a new AI system that can fix or edit videos by filling in missing or removed parts with realistic content that matches the rest of the video
What's the problem?
Current methods for fixing videos (video inpainting) struggle with two main issues: they have trouble creating entirely new objects in blank areas, and they find it hard to balance keeping the background looking right while also generating new things in the foreground
What's the solution?
The researchers created VideoPainter, which uses two separate parts working together. One part, called a context encoder, looks at the masked (blank) areas of the video and figures out what should go there based on the surroundings. The other part, which can be any pre-trained video AI, then fills in those areas. They also made a technique that helps keep things consistent even in very long videos, and created a huge dataset of over 390,000 video clips to train and test their system
Why it matters?
This matters because it could make video editing much easier and more powerful. Imagine being able to remove unwanted objects from videos or add new things that look completely real. It could be used in movies, social media, or even to restore damaged old videos. The fact that it works on videos of any length and performs well on many different measures of video quality makes it especially useful for real-world applications
Abstract
Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and <PRE_TAG>video editing pair data generation</POST_TAG>, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.