Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

2024-08-09

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Summary

This paper presents Puppet-Master, an advanced AI model that generates interactive videos by allowing users to control the motion of individual parts in a scene using simple drag inputs.

What's the problem?

Creating realistic animations from images can be difficult because traditional methods often require a lot of data and can only animate entire objects rather than their individual parts. This limits the flexibility and creativity of animators and developers who want to create detailed and dynamic scenes.

What's the solution?

Puppet-Master addresses this issue by using a new approach that combines a pre-trained video diffusion model with a unique method for controlling motion through dragging. Users can provide a single image and specify how they want different parts to move by dragging them on the screen. The model then generates a video that shows the realistic motion of those parts. It also introduces an improved attention mechanism that enhances the quality of the generated videos by better managing how backgrounds and object appearances are handled.

Why it matters?

This research is significant because it allows for more intuitive and flexible animation creation, making it easier for artists and developers to produce high-quality videos without needing extensive training data. By enabling detailed control over individual parts of a scene, Puppet-Master can enhance various applications in gaming, virtual reality, and film production.

Abstract

We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: vgg-puppetmaster.github.io.

View Paper