Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu
2024-10-04

Summary
This paper introduces Loong, a new video generation model that can create long videos lasting several minutes using autoregressive language models.
What's the problem?
Generating long videos (like those lasting minutes) is difficult because most existing models can only create short videos of a few seconds. This limitation makes it challenging to produce content-rich videos that are coherent and engaging over longer durations. Additionally, training these models effectively for longer videos presents unique challenges, such as balancing the learning process and managing errors that build up as more frames are generated.
What's the solution?
To overcome these challenges, the authors developed Loong, which treats both text and video as a unified sequence to generate longer videos. They implemented a training method called progressive short-to-long training, which starts with shorter videos and gradually increases the length. This helps the model learn better without getting overwhelmed by the complexity of longer videos. They also introduced strategies to reduce errors during video generation, ensuring smoother transitions and more coherent outputs. The model can be trained on 10-second videos and then extended to create minute-long videos based on text prompts.
Why it matters?
This research is significant because it pushes the boundaries of what video generation models can achieve, making it possible to create longer, more detailed videos from simple text descriptions. This advancement could benefit various applications, including filmmaking, education, and entertainment, where high-quality video content is essential.
Abstract
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.