World Simulation with Video Foundation Models for Physical AI

NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng

2025-11-04

World Simulation with Video Foundation Models for Physical AI

Summary

This paper introduces Cosmos-Predict2.5 and Cosmos-Transfer2.5, new AI models designed to better understand and simulate the physical world. They can create realistic videos from text, images, or other videos, and even translate between real and simulated environments.

What's the problem?

Creating AI that can reliably interact with the real world is difficult because it requires understanding physics and how things behave. Previous models struggled to generate high-quality, realistic videos and often didn't follow instructions well. Also, transferring what an AI learns in a simulation to the real world, or vice versa, was a challenge.

What's the solution?

The researchers built Cosmos-Predict2.5, a model that combines different types of input (text, images, videos) and uses a system called Cosmos-Reason1 to better understand text instructions. They trained it on a huge amount of video data and improved it using reinforcement learning. They also created Cosmos-Transfer2.5, a smaller model that excels at translating between simulated and real-world visuals. Both models are publicly available, including the code and data used to train them.

Why it matters?

These models are important because they make it easier to create realistic simulations for training robots and autonomous systems. This means we can test and improve these systems in a safe, virtual environment before deploying them in the real world, ultimately leading to more reliable and capable AI that can interact with our physical surroundings.

Abstract

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5times smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

View Paper