Cautious Weight Decay
Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu
2025-10-15
Summary
This paper introduces a simple change to how machine learning models are trained, called Cautious Weight Decay, that improves their performance.
What's the problem?
When training large neural networks, a technique called 'weight decay' is used to prevent the model from memorizing the training data and help it generalize to new data. However, standard weight decay can sometimes hinder the model from finding the best possible solution, especially when it gets close to a good solution. It essentially adds a constraint that isn't necessarily what we want, and can limit the model's ability to refine its results.
What's the solution?
Cautious Weight Decay only applies weight decay to the parts of the model's settings (parameters) that are moving in the same direction as the optimizer's update. Think of it like this: if the optimizer is trying to increase a value, weight decay only slows down that increase, not if it's trying to decrease it. This allows the model to continue searching for better solutions even when it's already performing well, and it doesn't change the core goal of the training process – just how it gets there.
Why it matters?
This method is easy to implement in existing training setups like AdamW, Lion, and Muon without needing to adjust any other settings. More importantly, it consistently leads to better results, meaning lower error rates and higher accuracy, when training very large models used for things like language processing and image recognition. This is significant because these large models are becoming increasingly common and important.
Abstract
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.