NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal

2025-08-21

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Summary

This paper introduces Nemotron-Nano-9B-v2, a new kind of AI language model that's faster at thinking through problems while still being really accurate, especially compared to other models of similar size. It mixes parts of two different AI architectures, Mamba and Transformer, to achieve this speed-up.

What's the problem?

Traditional AI language models, especially when dealing with complex reasoning tasks that require a lot of step-by-step thinking, can be slow. This slowness makes it hard for them to handle very long pieces of information or generate lengthy reasoning processes efficiently.

What's the solution?

The researchers created Nemotron-Nano-9B-v2 by modifying a Transformer model, replacing most of its attention layers with Mamba-2 layers, which are known to be faster for generating long sequences. They started with a larger 12-billion-parameter model and then used a special method called Minitron to shrink and fine-tune it so it could handle extremely long inputs, up to 128,000 tokens, on a single, more accessible GPU.

Why it matters?

This development is important because it offers a way to make AI models significantly faster at tasks requiring complex reasoning, like solving multi-step problems or understanding very long texts, without sacrificing accuracy. This means AI could be used more effectively in applications that need to process and reason about large amounts of information quickly, and it makes these powerful capabilities more accessible by requiring less hardware.

Abstract

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

View Paper