LongCat-Video Technical Report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang

2025-10-28

Summary

This paper introduces LongCat-Video, a new video generation model designed to create long, high-quality videos efficiently. It's a significant step towards building 'world models' – AI systems that understand and can simulate the real world.

What's the problem?

Creating realistic and lengthy videos with AI is really hard. Existing models often struggle to maintain quality and consistency over time, and generating long videos can take a very long time and require a lot of computing power. The goal is to build a model that can generate videos that are both high quality *and* long, without being incredibly slow or expensive.

What's the solution?

The researchers built LongCat-Video, a model with about 13.6 billion parameters, using a framework called Diffusion Transformer. This model can do a few different things – create videos from text descriptions, turn images into videos, and continue existing videos. They trained it specifically to be good at continuing videos, which helps it maintain quality over longer durations. To make it faster, they used a technique that generates videos in stages, starting with a low-resolution version and gradually adding detail, and also used a clever attention mechanism to focus on the most important parts of the video. Finally, they refined the model using a technique called reinforcement learning with multiple rewards to improve the overall quality.

Why it matters?

This work is important because it pushes the boundaries of what's possible with AI video generation. Being able to create long, coherent videos is crucial for building AI systems that can truly understand and interact with the world, like virtual assistants or simulations. The fact that the code and model are publicly available means other researchers can build upon this work and accelerate progress in the field.

Abstract

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

View Paper