Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham
2025-01-31

Summary
This paper talks about a new method called Streaming DiLoCo that makes it easier and faster to train large AI language models using multiple computers working together. It's like finding a smarter way for a group of students to collaborate on a massive project without needing to be in the same room or constantly talking to each other.
What's the problem?
Training big AI models is like solving a huge puzzle, and it takes a lot of computers working together to do it quickly. Usually, these computers need to be close to each other and connected with super-fast internet to share information constantly. This is expensive and limits where you can set up your AI training. Even with newer methods that let computers be further apart, they still need to send huge amounts of data all at once, which can slow things down.
What's the solution?
The researchers improved an existing method called DiLoCo in three clever ways. First, instead of sharing all the information at once, they share small bits at a time, like passing notes in class instead of shouting across the room. Second, they let the computers keep working while sharing information, like talking and walking at the same time. Third, they found a way to compress the shared information so it takes up less space, like using abbreviations in text messages. Together, these changes make the whole process much more efficient.
Why it matters?
This matters because it could make training big AI models much cheaper and more accessible. It's like finding a way to build skyscrapers with regular construction equipment instead of needing special cranes. This could lead to more people being able to create powerful AI models, potentially speeding up AI research and development. It also means AI training could be done using computers spread out across the world, rather than needing them all in one place. This could lead to more diverse and collaborative AI projects in the future.
Abstract
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.