Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective
Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang
2025-07-14
Summary
This paper talks about Lumos-1, a new AI model that generates videos frame by frame using a language model architecture with small changes, making it efficient and powerful.
What's the problem?
Previous video generation models either changed the usual language model design, used large extra systems, or were very slow, making them hard to use in real situations.
What's the solution?
The researchers enhanced the model by adding a new way to understand space and time in videos with a special position encoding called MM-RoPE and introduced a training method called Autoregressive Discrete Diffusion Forcing to balance learning across video frames. This design keeps the core language model structure while improving how it handles video data.
Why it matters?
This matters because Lumos-1 can create high-quality videos more efficiently and flexibly than older models, helping make video generation technology faster and easier to apply across different uses like turning text or images into videos.
Abstract
Lumos-1 is an autoregressive video generator that uses modified LLM architecture with MM-RoPE and AR-DF to handle spatiotemporal data efficiently, achieving competitive performance with reduced computational resources.