Training Video Foundation Models with NVIDIA NeMo

Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang

2025-03-18

Training Video Foundation Models with NVIDIA NeMo

Summary

This paper explores how to train powerful AI models for generating videos, called Video Foundation Models (VFMs), using NVIDIA's NeMo platform.

What's the problem?

Training high-quality VFMs that can create realistic videos is challenging due to the large amount of data required and the complexity of the models.

What's the solution?

The researchers created a scalable and open-source pipeline using NVIDIA NeMo to speed up the process of collecting video data, loading multimodal data, and training video diffusion models. They also analyzed the performance of the pipeline to find the best ways to train and use VFMs efficiently.

Why it matters?

This work matters because it makes it easier and faster to train VFMs, which can be used to create realistic simulations for training AI systems and to develop new visual experiences.

Abstract

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

View Paper