Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
2025-11-21
Summary
This paper introduces Nemotron Elastic, a new way to create different sizes of large language models (LLMs) without having to train each one completely from scratch. It's like building a family of models within a single, larger model.
What's the problem?
Training multiple LLMs, each with a different size and capability, is incredibly expensive and time-consuming. Existing methods to shrink models after they're trained, like pruning or knowledge distillation, still require a lot of computing power and data. Basically, it costs a fortune to have a range of models for different needs.
What's the solution?
Nemotron Elastic solves this by creating a 'parent' model that contains several smaller 'submodels' inside it. These submodels share the knowledge of the parent model but are optimized for different situations, like running on devices with less memory or needing faster responses. The key is that you can 'extract' these submodels without any extra training – they're ready to go immediately. They used a special 'router' and a carefully planned training process to make this work, and also improved how the model handles certain types of data to maintain performance.
Why it matters?
This research is important because it dramatically reduces the cost of developing and deploying a family of LLMs. They showed they could create smaller models that perform just as well as the best existing models, but with over 360 times less training cost compared to starting from scratch. This makes advanced AI more accessible and allows for more flexible deployment options, meaning you can have a powerful AI that adapts to different devices and budgets without sacrificing quality.
Abstract
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.