Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye

2026-05-01

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Summary

This paper introduces Nemotron 3 Nano Omni, a new artificial intelligence model that can understand and process different types of information like text, images, videos, and now, audio. It's an upgrade to a previous model, Nemotron Nano V2 VL, and performs better across all these areas.

What's the problem?

Existing AI models often struggle to seamlessly handle multiple types of data – like understanding what’s being said in a video *and* what’s visually happening. Also, many powerful AI models are slow and require a lot of computing power, making them impractical for some uses. The goal was to create a model that could handle all these data types effectively and efficiently.

What's the solution?

The researchers built Nemotron 3 Nano Omni using a strong base model and improved its architecture, the data it was trained on, and the training process itself. They also developed new techniques to reduce the amount of data the model needs to process, making it faster and more efficient without sacrificing accuracy. They are also sharing the model's core components and some training data so others can build upon their work.

Why it matters?

This model is important because it pushes the boundaries of what AI can do, especially in real-world applications. It excels at tasks like understanding complex documents, making sense of long audio and video clips, and helping computers perform tasks more like a human assistant. The increased efficiency also means this technology could become more accessible and usable in a wider range of situations.

Abstract

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

View Paper