Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gronskiy, Alex Kondratenko, Alex Neefus

2026-04-15

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Summary

This paper details the creation of Nemotron 3 Super, a really large and advanced language model, and how they made it work efficiently.

What's the problem?

Building powerful language models like those used for chatbots and writing assistance requires a lot of computing power and memory. Existing models, while good, can be slow and expensive to run, especially when dealing with long pieces of text. The challenge is to create a model that's both accurate *and* fast, without needing massive hardware.

What's the solution?

The researchers built Nemotron 3 Super, a 120 billion parameter model, using a few key innovations. First, they used a new number format (NVFP4) to represent the model's data, which saves space. Second, they implemented a clever system called LatentMoE that smartly distributes the workload across different parts of the model to maximize efficiency. Finally, they added special layers (MTP) to speed up the process of generating text, allowing it to predict what comes next more quickly. They trained this model on a huge amount of text – 25 trillion words – and then refined it with more focused training techniques.

Why it matters?

Nemotron 3 Super is significant because it achieves performance comparable to other top models but runs much faster. Specifically, it's up to 2.2 times faster than one model and 7.5 times faster than another. This means it can handle longer conversations or documents more easily and could make advanced AI more accessible because it requires less expensive hardware. The researchers also made the model and its components publicly available, allowing others to build upon their work.

Abstract

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

View Paper