Beyond Outliers: A Study of Optimizers Under Quantization
Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh
2025-10-10
Summary
This paper investigates how the choice of optimization algorithm—the method used to train a neural network—impacts how well the network performs after being compressed using a technique called quantization. Quantization makes models smaller and faster, but can sometimes reduce accuracy.
What's the problem?
Currently, there's a lack of understanding about how different optimizers interact with quantization. While both optimization and quantization are individually well-studied, it's unclear if an optimizer that works great for a full-sized model will still be the best choice *after* the model is quantized. Existing ways to predict how well a quantized model will perform don't always work across different optimizers, because they focus on individual errors instead of how errors build up throughout the network.
What's the solution?
The researchers trained several neural networks of varying sizes with six different optimizers. They then tested how well these networks performed after applying post-training quantization (PTQ), which compresses a pre-trained model, and quantization-aware training (QAT), which trains the model *while* considering quantization. They discovered that some optimizers, specifically one called Shampoo, consistently led to less accuracy loss during QAT and were more efficient in terms of how many parameters were needed to achieve a certain level of performance. They also showed why common metrics for predicting quantization performance are unreliable.
Why it matters?
This research is important because it provides guidance on selecting the best optimizer when deploying models on devices with limited resources, like phones or embedded systems. Choosing the right optimizer can significantly improve the accuracy and efficiency of quantized models, making them more practical for real-world applications. It also highlights the need to consider optimizer-quantization interactions, rather than treating them as separate problems.
Abstract
As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.