Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, Steven M. Seitz
2024-08-28

Summary
This paper presents a method called Generative Inbetweening, which adapts image-to-video models to create smooth video sequences by filling in the motion between two key frames.
What's the problem?
Creating videos that show smooth motion between two still images (key frames) can be challenging. Traditional methods often struggle to produce realistic and coherent movements, leading to choppy or unnatural video transitions.
What's the solution?
The authors adapted a large-scale image-to-video diffusion model, which was originally designed to generate videos from a single image, to work for key frame interpolation. They fine-tuned the model so it could predict video frames moving backwards in time between the two key frames. By using both the original model and this new version in a dual-directional sampling process, they were able to generate smoother and more coherent video sequences.
Why it matters?
This research is significant because it improves how we can create videos from simple images, making it easier for filmmakers and animators to produce high-quality content. By enhancing the technology for generating smooth transitions, it opens up new possibilities for creative storytelling in films and animations.
Abstract
We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.