MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye

2026-04-08

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Summary

This paper introduces MegaTrain, a new system for training extremely large language models – those with over 100 billion parameters – using just a single GPU.

What's the problem?

Training these massive language models is really hard because they require so much memory. Typically, all the information needed for training (the model's parameters and the data used to update them) has to fit on the GPU, which is expensive and limited. Existing methods that try to use CPU memory instead are often slow because transferring data between the CPU and GPU creates a bottleneck.

What's the solution?

MegaTrain flips the script by storing most of the model's information in the CPU's memory, treating the GPU as a fast processor that temporarily uses that data. To overcome the slow data transfer issue, they use a clever system that constantly streams data to the GPU while it's working, overlapping the process of getting data, calculating updates, and sending results back. They also simplified how the model tracks changes during training, making it more efficient and flexible.

Why it matters?

This is important because it makes training huge language models more accessible. Instead of needing a cluster of expensive GPUs, you can potentially train these models on a single, powerful GPU with a lot of CPU memory. This opens the door for more researchers and developers to work with these cutting-edge AI technologies and achieve faster training speeds compared to other methods.

Abstract

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84times the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

View Paper