Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge

2025-11-24

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Summary

This research details the first large-scale effort to train a powerful AI model, called ZAYA1, using only AMD computer hardware and a new high-speed connection technology called Pollara. It's a deep dive into what works well and what needs improvement when building and training these massive models on AMD systems.

What's the problem?

Training very large AI models requires a lot of computing power and fast connections between the computers doing the work. Traditionally, this has been done using NVIDIA hardware. AMD has been developing competitive hardware, but there wasn't a clear understanding of how to best utilize it for this kind of task, specifically at a large scale. There was a need to figure out how to optimize both the hardware setup and the model design to get good performance from AMD's new technology.

What's the solution?

The researchers thoroughly tested the AMD hardware and Pollara network, running many small tests to understand how quickly data could be moved and processed. They used these results to create guidelines for designing AI models that would run efficiently on AMD's MI300X GPUs. They then built and trained ZAYA1, a model with 8.3 billion total parameters (but only 760 million actively used at a time), and shared details about their training process, including how they handled errors and saved progress. They also compared ZAYA1’s performance to other existing models.

Why it matters?

This work shows that AMD hardware and its associated technologies are now capable of competing with NVIDIA in the demanding field of large AI model training. This is important because it creates more competition in the AI hardware market, potentially leading to lower costs and faster innovation. It also provides a blueprint for others looking to build and train large models using AMD systems, and demonstrates that ZAYA1 performs well compared to other models, even those much larger in overall size.

Abstract

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

View Paper