Mobius: Text to Seamless Looping Video Generation via Latent Shift

Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao

2025-02-28

Mobius: Text to Seamless Looping Video Generation via Latent Shift

Summary

This paper talks about Mobius, a new way to create seamlessly looping videos from text descriptions using AI. It's like having a magic video maker that can turn your words into endless, smooth animations without needing any extra help or images.

What's the problem?

Creating looping videos that look smooth and natural is hard, especially when you want to make them just from a text description. Current methods often need extra information or images to work well, which limits what kind of videos they can make.

What's the solution?

The researchers created Mobius, which uses a clever trick called latent shifting. It takes a pre-trained AI model for making videos and tweaks it to create perfect loops. Mobius connects the start and end of the video in a special way, then gradually shifts things around as it creates the video to make sure everything flows smoothly. This method can make videos of any length and doesn't need any extra images to work.

Why it matters?

This matters because it opens up new possibilities for creating visual content easily. Imagine being able to type a description and get a perfect looping video for social media, presentations, or digital art. It could save time for creators, make it easier to explain ideas visually, and even help in fields like education or advertising where engaging, looping visuals are valuable. Plus, since it works just from text, it's more flexible and creative than methods that need existing images or videos to start with.

Abstract

We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a <PRE_TAG><PRE_TAG>latent cycle</POST_TAG></POST_TAG> by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform <PRE_TAG>multi-frame <PRE_TAG><PRE_TAG>latent denoising</POST_TAG></POST_TAG></POST_TAG> by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the <PRE_TAG><PRE_TAG>latent cycle</POST_TAG></POST_TAG> in our method can be of any length. This extends our <PRE_TAG>latent-shifting approach</POST_TAG> to generate seamless looping videos beyond the scope of the <PRE_TAG>video diffusion model</POST_TAG>'s context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.

View Paper