NVIDIA Nemotron Nano V2 VL
NVIDIA, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Guo Chen, Karan Sapra, Zhiding Yu, Adi Renduchintala, Charles Wang, Peter Jin, Arushi Goel, Mike Ranzinger, Lukas Voegtle, Philipp Fischer, Timo Roman, Wei Ping
2025-11-07
Summary
This paper introduces Nemotron Nano V2 VL, a new and improved artificial intelligence model that's really good at understanding both images and text, especially when dealing with things like long documents and videos.
What's the problem?
Existing AI models often struggle with understanding complex information in long-form content like lengthy documents or videos, and they can be slow to process all that data. Previous versions of Nemotron also had room for improvement in overall performance across different types of visual and textual tasks.
What's the solution?
The researchers created Nemotron Nano V2 VL by making significant changes to the model's basic structure, the data it was trained on, and the way it was trained. They used a combination of two different types of AI architectures – Mamba and Transformer – and developed clever techniques to reduce the amount of data the model needs to process, making it faster. They are also sharing the model itself, the data used to train it, and the code they used, so others can build on their work.
Why it matters?
This new model is important because it pushes the boundaries of what AI can do with real-world information. It’s better at understanding complex documents and videos, and it does so more efficiently. This could lead to improvements in many areas, like automated document processing, video analysis, and more advanced AI assistants.
Abstract
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.