ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, Yunhe Wang

2025-11-26

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Summary

This paper introduces a new way to train really big computer programs called large language models, focusing on making the training process more stable and reliable.

What's the problem?

Training these massive models is tricky because even tiny errors in the calculations can throw everything off, leading to slow or failed training. Current methods to improve training, while faster, are sensitive to the size of the data and can be derailed by unusual data points or 'noise'. Basically, they're fragile and don't always work well in real-world situations.

What's the solution?

The researchers developed an optimizer called ROOT, which stands for Robust Orthogonalized Optimizer. It tackles the problem in two main ways: first, it adjusts its calculations based on the size of the data it's working with, ensuring accuracy across different parts of the model. Second, it uses a technique to filter out noisy data points that could disrupt the training process, while still paying attention to the important patterns in the data.

Why it matters?

This work is important because it provides a more dependable method for training the increasingly large and complex AI models that are becoming more common. A more robust training process means these models can be developed faster, perform better, and be more reliable in various applications, especially when dealing with messy or imperfect data.

Abstract

The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.

View Paper