MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng

2025-10-22

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Summary

This paper focuses on creating better ways to train computer programs that generate videos from text descriptions, like turning a sentence into a short clip.

What's the problem?

Training these video-generating programs is really hard because it takes a lot of computing power and data. It's difficult to get the video to accurately match the text, handle the long sequences of images that make up a video, and understand how things change over time within the video itself. Basically, it's a complex task that demands a lot of resources.

What's the solution?

The researchers developed a complete system for training these models, focusing on four key areas: how the data is prepared, the design of the model itself, the strategy used during training, and the computer infrastructure needed to handle the workload. They made improvements to each of these areas, resulting in a more efficient and higher-performing video generator called MUG-V 10B. They also made all their tools and the model itself freely available to others.

Why it matters?

This work is important because it pushes the boundaries of what's possible with AI-generated videos. Their model performs well compared to others, especially when creating videos for things like online shopping. More importantly, by sharing their code and model, they're helping other researchers build on their work and accelerate progress in this field, making it easier for anyone to create high-quality videos with AI.

Abstract

In recent years, large-scale generative models for visual content (e.g., images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V{our webpage}.

View Paper