All is Not Lost: LLM Recovery without Checkpoints

Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen

2025-06-19

All is Not Lost: LLM Recovery without Checkpoints

Summary

This paper talks about CheckFree and its improved version CheckFree+, a new way to recover large language model training when some computing parts fail, without needing to save and load big checkpoints.

What's the problem?

The problem is that training large AI models often runs on many computers, and if one computer fails, the whole process can be delayed or need expensive reloads from checkpoints, slowing down training and using more resources.

What's the solution?

The researchers created CheckFree, which replaces the failed part with an average of nearby parts, and CheckFree+ that also manages the first and last parts using a method called out-of-order pipeline execution. These methods avoid extra computing and storage, and still allow the training to continue smoothly by copying or averaging neighboring data.

Why it matters?

This matters because it makes training large AI models faster and cheaper by reducing downtime caused by failures, enabling more people to train powerful AI systems without expensive infrastructure or delays.

Abstract

A novel method, CheckFree, and its extended version CheckFree+, efficiently recover from node failures during LLM training by substituting failed stages with averaged neighboring stages or through out-of-order pipeline execution, improving convergence time over existing checkpointing methods.

View Paper