MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao

2025-10-22

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Summary

This paper focuses on making it possible to generate longer videos using a type of artificial intelligence called Diffusion Transformers, which are really good at creating realistic images and videos.

What's the problem?

Creating long videos with these AI models is difficult because of something called 'attention'. Attention allows the AI to focus on different parts of the video, but it becomes incredibly slow and requires a lot of computing power as the video gets longer. Existing methods to speed this up aren't very accurate because they simplify things too much by looking at the video in large chunks instead of focusing on the important connections between individual parts.

What's the solution?

The researchers came up with a new method called Mixture-of-Groups Attention, or MoGA. Instead of looking at large chunks, MoGA intelligently figures out which parts of the video are most related to each other and focuses on those connections specifically. It's like highlighting the most important sentences in a book instead of reading every single word. This method works well with existing techniques to make it even faster and more efficient.

Why it matters?

This research is important because it allows AI to generate much longer, higher-quality videos – up to a minute long at a decent resolution – without needing massive amounts of computing power. This opens the door for creating more realistic and complex videos for entertainment, education, and other applications.

Abstract

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

View Paper