MultiCOIN: Multi-Modal COntrollable Video INbetweening

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

2025-10-14

MultiCOIN: Multi-Modal COntrollable Video INbetweening

Summary

This paper introduces a new system called MultiCOIN for creating smooth transitions between video frames, often called 'inbetweening'. It aims to give users much more control over how these transitions look and behave.

What's the problem?

Current video inbetweening techniques struggle with complex movements or detailed changes. They often don't allow users to easily specify exactly *how* they want the transition to happen, leading to results that don't match the creator's vision or have errors like things not lining up correctly. Basically, existing methods aren't flexible or precise enough for creative video editing.

What's the solution?

The researchers developed MultiCOIN, which uses a powerful video generation model called DiT. The key is that MultiCOIN lets users control the inbetweening process in multiple ways – by specifying depth changes, drawing motion paths, using text descriptions, or highlighting areas where movement should focus. To make all these controls work together, they translate everything into a simple point-based system the model understands. They also split the controls into two parts: one for controlling the *motion* and one for controlling the *content* of the video, allowing for more nuanced results. Finally, they used a special training process to help the model learn how to handle all these controls effectively.

Why it matters?

This work is important because it makes video inbetweening much more versatile and user-friendly. It allows for more dynamic and customized video transitions, giving creators greater artistic control and the ability to create more visually compelling content. It moves beyond simple transitions to allow for complex and accurate visual storytelling.

Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

View Paper