Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All is Not Lost: LLM Recovery without Checkpoints (2506.15461v1)

Published 18 Jun 2025 in cs.DC and cs.LG

Abstract: Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator's scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nikolay Blagoev (3 papers)
  2. Oğuzhan Ersoy (5 papers)
  3. Lydia Yiyu Chen (2 papers)

Summary

This paper introduces "CheckFree" and "CheckFree+", two novel methods for recovering LLM training from stage failures in a distributed pipeline parallelism setup, particularly relevant for decentralized and "wimpy" computation nodes like spot instances. These methods aim to reduce the significant communication, storage, and computation overhead associated with traditional recovery techniques like checkpointing and redundant computation.

The core problem addressed is the loss of a training stage (a partition of the model's layers) due to node failures. Traditional checkpointing requires periodically saving the entire model, incurring high overhead, especially for large models. Redundant computation involves each node performing extra work for a subsequent stage, increasing computational load.

CheckFree: Memoryless Recovery for Intermediate Stages

CheckFree proposes reinitializing a failed intermediate stage by taking a weighted average of the weights of its two neighboring stages. The motivation stems from observations that:

  1. LLMs are resilient to layer omission, suggesting redundancy between layers.
  2. Layer stacking, where new layers are initialized from neighbors, can improve training.
  • Implementation:
    • When a stage SiS_i fails, its new node receives the weights Ws,i1W_{s,i-1} and Ws,i+1W_{s,i+1} from the neighboring stages Si1S_{i-1} and Si+1S_{i+1}, respectively.
    • It also receives the squared L2 norm of the last gradients from these neighbors: ωi1=Ws,i12\omega_{i-1} = ||\nabla W_{s,i-1}||^2 and ωi+1=Ws,i+12\omega_{i+1} = ||\nabla W_{s,i+1}||^2.
    • The weights of the failed stage Ws,iW_{s,i} are initialized as:

      Ws,iωi1Ws,i1+ωi+1Ws,i+1ωi1+ωi+1W_{s,i} \leftarrow \frac{\omega_{i-1} W_{s,i-1} + \omega_{i+1} W_{s,i+1}}{\omega_{i-1} + \omega_{i+1}}

      This gives more importance to stages that are less converged (higher gradient norm), effectively offloading some of their learning to the new stage.

    • After reinitialization, the learning rate is scaled up by a factor of 1.1 to help the new stage adapt.

    • The storage and communication overhead for the gradient norms is negligible (a single scalar per stage).

The recovery process for a failed stage ii is outlined in Algorithm 1:

1
2
3
4
5
6
7
Algorithm 1: Recovery algorithm for stage i
1: REQUIRE: new node assigned to stage i, λ learning rate
2: Receive W_{s,i-1} and ω_{i-1} of stage i-1 where ω_{i-1} = ||∇W_{s,i-1}||²
3: Receive W_{s,i+1} and ω_{i+1} of stage i+1 where ω_{i+1} = ||∇W_{s,i+1}||²
4: Initialize the weights of the failed stage W_{s,i} ← (ω_{i-1}*W_{s,i-1} + ω_{i+1}*W_{s,i+1}) / (ω_{i-1} + ω_{i+1})
5: Update learning rate λ ← 1.1λ
6: Continue training from the current batch

CheckFree, however, cannot recover the first and last stages of the pipeline as they only have one neighbor, and simply copying from that single neighbor leads to a significant performance drop.

CheckFree+: First and Last Stage Recovery

CheckFree+ extends CheckFree to handle failures of the first and last stages.

  • Implementation for First/Last Transformer Stages:
    • It utilizes out-of-order pipeline execution. For half the microbatches, the standard pipeline order (S0,S1,S2,,SL1,SL,S0S_0, S_1, S_2, \dots, S_{L-1}, S_L, S_0) is used.
    • For the other half, the order of the first two and last two transformer stages is swapped (e.g., S0,S2,S1,,SL,SL1,S0S_0, S_2, S_1, \dots, S_L, S_{L-1}, S_0). S0S_0 typically holds embedding/de-embedding layers.
    • This makes stage S2S_2 learn the behavior of S1S_1, and SL1S_{L-1} learn the behavior of SLS_L, without additional computation, as they effectively take their places in the pipeline for half the time.
    • If S1S_1 (or SLS_L) fails, it can be recovered by simply copying the weights from S2S_2 (or SL1S_{L-1}).
  • Implementation for Embedding/De-embedding Layers:
    • The (de)embedding layers (typically in S0S_0) are critical. CheckFree+ handles their recovery by copying these layers to the neighboring stages (S1S_1 and SLS_L).
    • This requires a small storage overhead (O(E)O(|\mathcal{E}|) where E\mathcal{E} is the embedding layer size), which is significantly smaller than the full model size. If S0S_0 fails, its weights can be recovered exactly from these copies.

Comparison of Recovery Strategies:

Feature Checkpointing Redundant Comp. CheckFree CheckFree+
Additional Memory O(F)O(|\mathcal{F}|) O(F)O(|\mathcal{F}|) 0 O(E)O(|\mathcal{E}|)
Additional Comm. O(F)O(|\mathcal{F}|) O(F)O(|\mathcal{F}|) 0 O(E)O(|\mathcal{E}|)
Additional Comp. 0 Forward pass 0 0
Non-faulty storage Yes No No No
Recovery of stages All stages Non-consecutive Non-consecutive intermediate stages Non-consecutive stages

Evaluation:

The methods were evaluated on LLaMa models (124M, 500M, 1.5B parameters) with varying stage failure rates (5%, 10%, 16% per hour).

  • Baselines: Checkpointing (checkpointing every ~3 hours) and Redundant Computation (each stage also computes the next stage's forward pass).
  • Performance Metrics: Iteration time and total training time to reach a target validation loss.
  • Results:
    • CheckFree and CheckFree+ significantly outperformed checkpointing in terms of wall-clock training time due to avoiding rollback delays.
    • At low to medium failure rates (5-10%), CheckFree and CheckFree+ were over 12% faster in wall-clock time to convergence than redundant computation. This is because while redundant computation might converge in fewer iterations, its iteration time is much higher.
    • CheckFree+ demonstrated robustness across different failure frequencies, with only slight degradation in validation loss even when failure rates tripled.
    • The out-of-order swapping in CheckFree+ incurs a convergence slowdown in a no-failure setting, but is beneficial when failures are present.
    • Large models trained with CheckFree+ (at 16% failure rate) achieved perplexity comparable to models trained without faults, despite different resulting weights.

Practical Implications and Limitations:

  • CheckFree and CheckFree+ offer a lightweight recovery mechanism without needing external storage or redundant computations, making them suitable for cost-sensitive decentralized training.
  • The recovery time for a failed stage is around 30 seconds.
  • A key limitation is that neither method can recover from consecutive stage failures.
  • The out-of-order execution in CheckFree+ can slightly slow down convergence in scenarios with very few or no failures.

Conclusion:

CheckFree and CheckFree+ present efficient alternatives for fault tolerance in distributed LLM training. By leveraging LLM properties like layer redundancy and resilience to omission, they achieve faster training times under failure conditions compared to existing methods, with minimal overhead. The code is available at \url{https://github.com/gensyn-ai/CheckFree}. Future work includes addressing consecutive failures and improving convergence in non-faulty cases for CheckFree+.