< Explain other AI papers

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

2025-12-01

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

Summary

This paper focuses on making small language models (SLMs) run faster on actual devices, which is crucial for applications where quick responses are needed, like on phones or in real-time systems.

What's the problem?

Simply making SLMs smaller in terms of the number of calculations they need to do doesn't automatically make them faster in the real world. Previous research focused on reducing the model size, but didn't fully consider how different design choices impact speed on devices. The core issue is understanding what specifically causes delays when these models are running, and how to design them to minimize those delays.

What's the solution?

The researchers investigated two main things: how the model is structured (specifically the balance between its depth and width) and which mathematical operations are used inside the model. They found that very deep, but narrow, models aren't always the fastest, even if they're accurate. They also tested new, more efficient operations to replace standard ones. Finally, they used a system that automatically tries different combinations of these efficient operations to find the fastest possible model design, and improved the training process with a technique called weight normalization to help the model learn better. This resulted in a new family of models called Nemotron-Flash.

Why it matters?

This work is important because it provides a way to build SLMs that are both accurate *and* fast. The Nemotron-Flash models they created are significantly faster and more efficient than existing SLMs like Qwen3, meaning they can be used in more applications where speed is critical, and deliver better performance overall.

Abstract

Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.