VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue

2025-10-10

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Summary

This paper introduces a new way to generate and edit videos, allowing users to specify exactly where and when to add or change content within a video, almost like painting on a moving picture. It combines several existing video editing techniques into one unified approach.

What's the problem?

Current video generation models struggle with precise control over individual frames because they compress multiple frames into a single representation. This makes it difficult to tell the model *exactly* what should happen at a specific point in time within the video, creating a blurry or ambiguous result when trying to edit or complete a video in a very specific way.

What's the solution?

The researchers developed a system called VideoCanvas that cleverly uses existing technology without needing to train a new model. It handles spatial control – where things are placed – by simply leaving space for them. More importantly, it uses a technique called Temporal RoPE Interpolation to give each editing instruction a precise time within the video, resolving the ambiguity and allowing for frame-by-frame control. Essentially, it tells the model *exactly* when each change should occur.

Why it matters?

This work is important because it provides a much more flexible and powerful way to generate and edit videos. It’s a step towards being able to create videos exactly how you want them, combining different editing tasks like filling in missing parts, extending videos, or creating entirely new content from scratch, all within a single system. They also created a new benchmark to measure how well these kinds of video editing systems perform.

Abstract

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

View Paper