AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

Mark Lee, Tom Gunter, Chang Lan, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma

2025-07-09

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

Summary

This paper talks about AXLearn, a new deep learning system designed to help train very large AI models efficiently across different types of computer hardware. It focuses on being modular, which means different parts can be added or changed easily without disrupting the whole system.

What's the problem?

The problem is that training big AI models requires a lot of computing power and different types of hardware like GPUs and TPUs. Most existing systems are hard to customize and don’t work well when mixing different hardware, making it complex and slow to train these models.

What's the solution?

The researchers built AXLearn with strict modular design and software interfaces that let developers plug in different components easily. They introduced a way to measure how complex these systems are by counting the lines of code needed, showing that AXLearn keeps complexity stable even as it scales. This design also supports advanced features with very little extra coding and maintains high performance compared to other systems.

Why it matters?

This matters because AXLearn makes it easier and faster to develop and train cutting-edge AI models on a variety of hardware setups. This flexibility helps researchers innovate more quickly and reduces the cost and difficulty of training large AI models.

Abstract

AXLearn is a modular deep learning system designed for scalable and high-performance training on heterogeneous hardware, maintaining constant complexity and equivalent performance to state-of-the-art systems.

View Paper