DeMo: Decoupled Momentum Optimization

Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma

2024-12-02

Summary

This paper introduces DeMo, a new optimization method that improves the training of large neural networks by reducing the need for fast communication between computers.

What's the problem?

Training large neural networks often requires sharing information (called gradients) between multiple computers (or accelerators) quickly. This can be a problem because it needs specialized high-speed connections, which can be expensive and complicated. If the communication is slow or limited, it can slow down the training process and make it less efficient.

What's the solution?

DeMo solves this problem by allowing different computers to work with slightly different versions of the model's optimizer (the algorithm that helps improve the model). Instead of synchronizing everything perfectly, DeMo lets these optimizers diverge a bit while still improving overall performance. This means that less data needs to be shared between computers, making it possible to train large models even when the network speed is low. The method is flexible and works with various types of computer setups, and it has been shown to perform as well as or better than existing methods like AdamW without needing high-speed connections.

Why it matters?

This research is important because it makes training large neural networks more accessible and efficient, especially in environments where high-speed connections are not available. By reducing communication requirements, DeMo allows more people and organizations to train advanced AI models, which can lead to better applications in fields like natural language processing, computer vision, and more.

Abstract

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {De}coupled {Mo}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo

View Paper