All is Not Lost: LLM Recovery without Checkpoints (2506.15461v1)

Published 18 Jun 2025 in cs.DC and cs.LG

Abstract: Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator's scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.

Authors (3)

Nikolay Blagoev (3 papers)
Oğuzhan Ersoy (5 papers)
Lydia Yiyu Chen (2 papers)

Summary

This paper introduces "CheckFree" and "CheckFree+", two novel methods for recovering LLM training from stage failures in a distributed pipeline parallelism setup, particularly relevant for decentralized and "wimpy" computation nodes like spot instances. These methods aim to reduce the significant communication, storage, and computation overhead associated with traditional recovery techniques like checkpointing and redundant computation.

The core problem addressed is the loss of a training stage (a partition of the model's layers) due to node failures. Traditional checkpointing requires periodically saving the entire model, incurring high overhead, especially for large models. Redundant computation involves each node performing extra work for a subsequent stage, increasing computational load.

CheckFree: Memoryless Recovery for Intermediate Stages

CheckFree proposes reinitializing a failed intermediate stage by taking a weighted average of the weights of its two neighboring stages. The motivation stems from observations that:

LLMs are resilient to layer omission, suggesting redundancy between layers.
Layer stacking, where new layers are initialized from neighbors, can improve training.

Implementation:
- When a stage $S_i$ fails, its new node receives the weights $W_{s,i-1}$ and $W_{s,i+1}$ from the neighboring stages $S_{i-1}$ and $S_{i+1}$ , respectively.
- It also receives the squared L2 norm of the last gradients from these neighbors: $\omega_{i-1} = ||\nabla W_{s,i-1}||^2$ and $\omega_{i+1} = ||\nabla W_{s,i+1}||^2$ .
- The weights of the failed stage $W_{s,i}$ are initialized as:
  
  $W_{s,i} \leftarrow \frac{\omega_{i-1} W_{s,i-1} + \omega_{i+1} W_{s,i+1}}{\omega_{i-1} + \omega_{i+1}}$
  
  This gives more importance to stages that are less converged (higher gradient norm), effectively offloading some of their learning to the new stage.
- After reinitialization, the learning rate is scaled up by a factor of 1.1 to help the new stage adapt.
- The storage and communication overhead for the gradient norms is negligible (a single scalar per stage).

The recovery process for a failed stage $i$ is outlined in Algorithm 1:

Algorithm 1: Recovery algorithm for stage i
1: REQUIRE: new node assigned to stage i, λ learning rate
2: Receive W_{s,i-1} and ω_{i-1} of stage i-1 where ω_{i-1} = ||∇W_{s,i-1}||²
3: Receive W_{s,i+1} and ω_{i+1} of stage i+1 where ω_{i+1} = ||∇W_{s,i+1}||²
4: Initialize the weights of the failed stage W_{s,i} ← (ω_{i-1}*W_{s,i-1} + ω_{i+1}*W_{s,i+1}) / (ω_{i-1} + ω_{i+1})
5: Update learning rate λ ← 1.1λ
6: Continue training from the current batch

CheckFree, however, cannot recover the first and last stages of the pipeline as they only have one neighbor, and simply copying from that single neighbor leads to a significant performance drop.

CheckFree+: First and Last Stage Recovery

CheckFree+ extends CheckFree to handle failures of the first and last stages.

Implementation for First/Last Transformer Stages:
- It utilizes out-of-order pipeline execution. For half the microbatches, the standard pipeline order ( $S_0, S_1, S_2, \dots, S_{L-1}, S_L, S_0$ ) is used.
- For the other half, the order of the first two and last two transformer stages is swapped (e.g., $S_0, S_2, S_1, \dots, S_L, S_{L-1}, S_0$ ). $S_0$ typically holds embedding/de-embedding layers.
- This makes stage $S_2$ learn the behavior of $S_1$ , and $S_{L-1}$ learn the behavior of $S_L$ , without additional computation, as they effectively take their places in the pipeline for half the time.
- If $S_1$ (or $S_L$ ) fails, it can be recovered by simply copying the weights from $S_2$ (or $S_{L-1}$ ).
Implementation for Embedding/De-embedding Layers:
- The (de)embedding layers (typically in $S_0$ ) are critical. CheckFree+ handles their recovery by copying these layers to the neighboring stages ( $S_1$ and $S_L$ ).
- This requires a small storage overhead ( $O(|\mathcal{E}|)$ where $\mathcal{E}$ is the embedding layer size), which is significantly smaller than the full model size. If $S_0$ fails, its weights can be recovered exactly from these copies.

Comparison of Recovery Strategies:

Feature	Checkpointing	Redundant Comp.	CheckFree	CheckFree+
Additional Memory	$O(\|\mathcal{F}\|)$	$O(\|\mathcal{F}\|)$	0	$O(\|\mathcal{E}\|)$
Additional Comm.	$O(\|\mathcal{F}\|)$	$O(\|\mathcal{F}\|)$	0	$O(\|\mathcal{E}\|)$
Additional Comp.	0	Forward pass	0	0
Non-faulty storage	Yes	No	No	No
Recovery of stages	All stages	Non-consecutive	Non-consecutive intermediate stages	Non-consecutive stages

Evaluation:

The methods were evaluated on LLaMa models (124M, 500M, 1.5B parameters) with varying stage failure rates (5%, 10%, 16% per hour).

Baselines: Checkpointing (checkpointing every ~3 hours) and Redundant Computation (each stage also computes the next stage's forward pass).
Performance Metrics: Iteration time and total training time to reach a target validation loss.
Results:
- CheckFree and CheckFree+ significantly outperformed checkpointing in terms of wall-clock training time due to avoiding rollback delays.
- At low to medium failure rates (5-10%), CheckFree and CheckFree+ were over 12% faster in wall-clock time to convergence than redundant computation. This is because while redundant computation might converge in fewer iterations, its iteration time is much higher.
- CheckFree+ demonstrated robustness across different failure frequencies, with only slight degradation in validation loss even when failure rates tripled.
- The out-of-order swapping in CheckFree+ incurs a convergence slowdown in a no-failure setting, but is beneficial when failures are present.
- Large models trained with CheckFree+ (at 16% failure rate) achieved perplexity comparable to models trained without faults, despite different resulting weights.

Practical Implications and Limitations:

CheckFree and CheckFree+ offer a lightweight recovery mechanism without needing external storage or redundant computations, making them suitable for cost-sensitive decentralized training.
The recovery time for a failed stage is around 30 seconds.
A key limitation is that neither method can recover from consecutive stage failures.
The out-of-order execution in CheckFree+ can slightly slow down convergence in scenarios with very few or no failures.

Conclusion:

CheckFree and CheckFree+ present efficient alternatives for fault tolerance in distributed LLM training. By leveraging LLM properties like layer redundancy and resilience to omission, they achieve faster training times under failure conditions compared to existing methods, with minimal overhead. The code is available at \url{https://github.com/gensyn-ai/CheckFree}. Future work includes addressing consecutive failures and improving convergence in non-faulty cases for CheckFree+.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gensynai/status/1935760581296115887

https://twitter.com/gensynai/status/1935757925005639739

https://twitter.com/gensynai/status/1935757928960819453

https://twitter.com/rohanpaul_ai/status/1936546201895334209