Mimir: Improving Video Diffusion Models for Precise Text Understanding

Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang

2024-12-05

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Summary

This paper presents Mimir, a new framework designed to improve video generation from text descriptions by enhancing how well the model understands and processes text.

What's the problem?

Current video diffusion models that generate videos from text often struggle to fully understand the text because they rely on features from text encoders that do not capture all the nuances of language. This limitation makes it hard for these models to create high-quality videos that accurately reflect the text descriptions, especially when the descriptions are complex or require detailed understanding.

What's the solution?

Mimir addresses this problem by integrating large language models (LLMs) into the video generation process. It uses a special component called a token fuser to combine outputs from both traditional text encoders and LLMs. This allows the model to take advantage of the strengths of each approach, enabling better comprehension of text and more accurate video generation. Mimir can generate videos that align closely with short captions and manage dynamic movements effectively, resulting in high-quality outputs.

Why it matters?

This research is important because it significantly enhances the ability of AI systems to create videos from text, making them more useful for applications like filmmaking, animation, and content creation. By improving how well these models understand text, Mimir can lead to more creative and engaging video content, ultimately benefiting industries that rely on visual storytelling.

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: https://lucaria-academy.github.io/Mimir/

View Paper