StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao

2025-07-08

StreamDiT: Real-Time Streaming Text-to-Video Generation

Summary

This paper talks about StreamDiT, a new AI model that generates videos from text descriptions in real time. It uses advanced techniques like transformer-based diffusion, flow matching, and adaptive layer normalization to produce smooth and continuous videos at 16 frames per second.

What's the problem?

The problem is that most current text-to-video models can only create short clips offline and are too slow for applications that need real-time or interactive video generation.

What's the solution?

The researchers developed StreamDiT with a streaming design that divides videos into segments and uses special training methods to improve both the quality and consistency of the videos. They also use a clever distillation technique that reduces the computation needed during video generation, allowing StreamDiT to run efficiently on one GPU at real-time speeds.

Why it matters?

This matters because StreamDiT makes it possible to create videos on the fly from text prompts, opening up new possibilities for interactive storytelling, virtual avatars, and live content creation where quick and smooth video generation is essential.

Abstract

StreamDiT, a streaming video generation model using transformer-based diffusion with flow matching and adaLN DiT, achieves real-time performance at 16 FPS with 4B parameters and multistep distillation.

View Paper