The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang

2025-03-07

The Best of Both Worlds: Integrating Language Models and Diffusion
Models for Video Generation

Summary

This paper talks about LanDiff, a new AI system that combines two different methods to create high-quality videos from text descriptions, making the process faster and more accurate

What's the problem?

There are two main ways to generate videos from text: language models and diffusion models. Language models are good at understanding the meaning of text but often make mistakes in video quality, while diffusion models create better visuals but struggle to understand complex ideas or relationships in the text

What's the solution?

The researchers created LanDiff, which combines the strengths of both methods. It uses a tool called a semantic tokenizer to simplify video data into smaller pieces that are easier to work with. Then, a language model generates the basic structure of the video based on the text, and a diffusion model polishes it into a high-quality final video. This approach balances understanding and visual quality

Why it matters?

This matters because it improves how AI can turn written ideas into videos, which could be used for things like creating educational content, entertainment, or realistic simulations. LanDiff outperforms other models in both quality and efficiency, making it a big step forward in text-to-video technology

Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a sim14,000times compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

View Paper