Video Creation by Demonstration

Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu

2024-12-12

Summary

This paper introduces a new way to create videos called Video Creation by Demonstration. It allows users to generate realistic videos by providing a demonstration video and an image from a different scene, combining the actions shown in the demo with the new context.

What's the problem?

Creating videos that look natural and make sense can be challenging, especially when trying to combine actions from one video with a different background. Traditional methods often struggle to maintain realism and coherence when generating these videos, making it hard for users to achieve their desired results.

What's the solution?

The authors propose a method called delta-Diffusion, which uses a self-supervised training approach. This means it learns from existing videos without needing detailed labels. By using an implicit control system instead of explicit signals, delta-Diffusion can flexibly generate new video sequences that accurately reflect the actions in the demonstration video while fitting into the context of the provided image. The model is trained to predict future frames based on the input, allowing it to create smooth and realistic transitions.

Why it matters?

This method is significant because it simplifies video creation, enabling people to generate high-quality videos more easily. It opens up new possibilities for applications in gaming, animation, and virtual reality by allowing users to visually demonstrate actions in various contexts without needing extensive technical skills.

Abstract

We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present delta-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, delta-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.

View Paper