Mind the Time: Temporally-Controlled Multi-Event Video Generation

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov

2024-12-09

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Summary

This paper talks about MinT, a new system for generating videos that can accurately depict multiple events in a specific order and with precise timing based on a single text input.

What's the problem?

Existing video generation models often struggle to create videos that show several events happening in the correct order. When given a single paragraph of text, these models may ignore some events or jumble their sequence, making the resulting video confusing or incomplete.

What's the solution?

MinT addresses this issue by allowing each event to be linked to a specific time in the video. This means the model can focus on one event at a time, ensuring they are shown in the right order. To help with this, the authors developed a new method called ReRoPE, which helps the model understand when each event should occur. By fine-tuning a pre-trained video model with this approach, MinT can generate coherent videos where events are smoothly connected and accurately timed.

Why it matters?

This research is important because it improves how AI can create videos that reflect complex sequences of events. By allowing for better control over timing, MinT can enhance applications in storytelling, education, and entertainment, making it easier for users to produce high-quality videos that convey their intended messages clearly.

Abstract

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.

View Paper