Latest Weight Averaging (LAWA)

Updated 11 December 2025

Latest Weight Averaging (LAWA) is a family of algorithms that computes the mean of the most recent model weights to speed up training and improve generalization.
LAWA encompasses diverse strategies such as sliding-window, multi-model, and hierarchical averaging, each tailored to address specific training dynamics and robustness challenges.
Empirical results demonstrate that LAWA reduces training time, boosts validation accuracy, and enhances robustness across various tasks including vision and language modeling.

Latest Weight Averaging (LAWA) encompasses a family of algorithms and strategies that formaveraged model weights from the most recent parameter iterates, checkpoint trajectories, or independently trained submodels, with the aims of accelerating training, improving generalization, enabling efficient model merging, and facilitating robust adaptation under task or distribution shift. LAWA contrasts with traditional stochastic weight averaging (SWA), which averages checkpoints collected over much longer training horizons or in the late phases. The paradigm now spans simple trajectory averaging, multi-model and multi-domain merging, distributed ensemble training, and specialized procedures for adversarial robustness.

1. Definition and Varieties of Latest Weight Averaging

LAWA refers to weight averaging schemes that compute an averaged model from the $k$ most recent checkpoints or parameter vectors, leveraging either a trajectory of a single training run or a population of related models. The general formula, for parameter vectors $w_1,\ldots,w_E$ at epochs or steps $1,\ldots,E$ , is

$w^{\text{LAWA}}_E = \frac{1}{k} \sum_{i=E-k+1}^E w_i.$

More generally, non-uniform weights can be used, and LAWA may operate over sliding windows, offline checkpoint collections, or buffers updated every $\nu$ steps (Kaddour, 2022, Ajroldi et al., 10 Feb 2025).

Variants include:

Sliding-window LAWA: Averages a fixed number of the latest checkpoints during ongoing training (Kaddour, 2022, Sanyal et al., 2023).
Adapter and domain LAWA: Averages parameters of independently trained domain adapters or task-specific submodels (e.g., AdapterSoup) (Chronopoulou et al., 2023, Choi et al., 11 Dec 2024).
Distributed LAWA: Applies to populations of parallel models, e.g., via weight shuffling (WASH) (Fournier et al., 27 May 2024).
Hierarchical LAWA: Combines online (parallel) and offline (trajectory) averaging in multi-cycle regimes (Gu et al., 2023).

LAWA's central principle is to exploit the statistical proximity of the latest checkpoints or related models to construct a solution that achieves both rapid convergence and favorable generalization.

2. Theoretical Foundations and Convergence Guarantees

The theoretical basis for LAWA is captured by recent finite weight averaging (FWA) analyses. Under standard convexity, smoothness, and bounded variance assumptions, averaging the last $k$ iterates achieves the following convergence rate: $\mathbb{E}[F(\bar w^k_T)] - F(w^*) = \mathcal{O}\left( \frac{\log(T/(2k))}{\sqrt{T}} \right),$ where $T$ is the total number of SGD steps (Wang et al., 20 Nov 2024). This improves the classical SGD bound by replacing the $\log T$ factor with the strictly smaller $\log(T/k)$ .

In generalization, FWA (including LAWA) achieves tighter stability and generalization bounds compared to last-iterate SGD, especially for constant learning rates. For a fixed step size $\alpha$ and convex losses,

$\epsilon_{\text{gen}} \leq \frac{2\alpha L^2}{n}(T - k/2),$

where $n$ is dataset size. For non-convex regimes, recursive stability analyses confirm that FWA reduces instability more effectively with larger $k$ under small constant learning rates, while decaying schedules may partially offset this benefit (Wang et al., 20 Nov 2024).

These results explain why LAWA often yields both faster convergence and improved generalization, as empirically validated on linear regression, vision tasks (CIFAR-10/100), and large-scale LLMs (Kaddour, 2022, Sanyal et al., 2023, Ajroldi et al., 10 Feb 2025).

3. Methodological Instantiations

3.1 Sliding Window and Trajectory Averaging

In practice, LAWA is most simply realized by maintaining a rolling buffer of the last $k$ checkpoints and returning their mean (Kaddour, 2022, Ajroldi et al., 10 Feb 2025):

k = 6
ckpts = []
for epoch in range(num_epochs):
    ...                 # train step
    ckpts.append(model.parameters())
    if len(ckpts) > k:
        ckpts.pop(0)
    lawa_model.load_params(mean(ckpts))

Empirical recommendations are

k = 4

–10; larger

k

gives diminishing returns and can introduce averaging over divergent regions (Kaddour, 2022, Sanyal et al., 2023, Ajroldi et al., 10 Feb 2025).

3.2 Multi-Model and Domain Averaging

For merging $T$ independently fine-tuned models, LAWA uses the uniform or weighted mean: $\theta_{\text{avg}} = \frac{1}{T} \sum_{t=1}^{T} \theta_t.$ Recent advances apply LAWA to task adapters (AdapterSoup), where a subset $S$ of $k$ domain adapters' parameters $\{\varphi_i : i \in S\}$ are averaged: $\bar{\varphi} = \frac{1}{k} \sum_{i \in S} \varphi_i,$ with selection based on sentence similarity (Sentence-BERT) or GMM clustering of CLS embeddings (Chronopoulou et al., 2023).

Novel "centered" LAWA frameworks use the SVD of centered task vectors $\Delta_t = \theta_t - \theta_{\text{avg}}$ , projecting onto the leading $k$ singular directions, which suppresses inter-task interference and enhances multi-task performance (Choi et al., 11 Dec 2024).

3.3 Distributed and Hierarchical Weight Averaging

WASH (Weight Averaging using parameter SHuffling) aligns populations of neural nets in the same loss basin via low-frequency, random weight exchanges during distributed training. Each step, a fraction $p$ of coordinates are shuffled across $N$ models, keeping models alignable and post-hoc averaging effective (Fournier et al., 27 May 2024). This operator preserves the squared distance to the mean, maintaining ensemble diversity.

Hierarchical Weight Averaging (HWA) unifies online (model-parallel, synchronized averaging) and offline (sliding-window) WA. At each synchronization cycle ( $H$ steps), $K$ local models are averaged, and then the last $I$ cycle-averages are themselves averaged post-training (Gu et al., 2023).

4. Empirical Performance and Practical Guidelines

Across architectures (ResNet, ViT, U-Net, LLMs), LAWA and its variants yield:

Training acceleration: LAWA advances loss/accuracy curves by tens of epochs on ImageNet/BERT, reducing compute by 20–30%, and systematically reaches validation targets in 15–25% fewer steps (Kaddour, 2022, Ajroldi et al., 10 Feb 2025).
Generalization improvement: Consistent but mild gains (0.5–2%) in held-out accuracy or BLEU/WER on tasks from vision to speech and translation (Ajroldi et al., 10 Feb 2025).
Robustness under distribution shift and model merging: Weight-averaged "soups" (e.g., AdapterSoup) improve zero-shot perplexity for novel domains without retraining, with best out-domain results from careful adapter selection (clustering-based) (Chronopoulou et al., 2023). Centered, low-rank LAWA (CART) closes most of the gap to traditional multitask learning (Choi et al., 11 Dec 2024).
Adversarial robustness: In fast adversarial training, auto weight averaging (A-WA) variants improve robust accuracy by discarding poor iterates based on on-the-fly attack quality (Jia et al., 2023).
Distributed training: WASH achieves communication cost reductions up to $200\times$ , delivering averaged model accuracy nearly matching full ensembles on CIFAR/ImageNet (Fournier et al., 27 May 2024).

Key best-practices:

Begin LAWA after warmup or 10–15% into training; overly early averaging may hurt (Sanyal et al., 2023).
Use window sizes $k$ or $L$ around 1% of total steps for LAWA; tune buffer length for optimal speedup (Ajroldi et al., 10 Feb 2025).
LAWA is optimizer-agnostic and robust to base LR settings; it is compatible with Adam, NadamW, Shampoo, and second-order methods (Ajroldi et al., 10 Feb 2025).
In multi-model soup constructions, model diversity (via hyperparameter sweep or gradient-similarity penalty) substantially enhances OOD generalization (Xu, 14 Jan 2025).
Combine LAWA with learning-rate annealing for maximal effect; pure averaging is not a substitute for LR schedules (Ajroldi et al., 10 Feb 2025).

5. Extensions and Specialized LAWA Algorithms

AdapterSoup and Domain-Averaged Adapters

AdapterSoup (Adapter-based LAWA) forms a single averaged domain adapter for pretrained LMs via arithmetic mean over a selected subset. Adapter selection exploits semantic similarity and clustering to avoid negative interference seen in uniform soups. The method is embarrassingly parallel, requires no retraining for new domains, and achieves test-time efficiency equal to single-adapter inference, provided sufficient memory to store adapter sets. High diversity among adapters (e.g., via higher LR) improves OOD generalization (Chronopoulou et al., 2023).

Model Merging and Centered Averaging

CART (Centered Averaging with Low-Rank Truncation) reframes merging as SVD on centered task directions, trillates the outcome by retaining only high-variance shared components. Empirically, small-rank projections yield peak multitask accuracy, outperforming naive merging and retaining stability as the number $T$ of merged tasks grows (Choi et al., 11 Dec 2024).

Robust and Selective Averaging

Auto Weight Averaging (A-WA) for adversarially robust training incorporates only those parameter updates passing an adversarial success threshold. This discriminative step protects the aggregation from catastrophic overfitting that disables ordinary EMA/SWA in single-step AT (Jia et al., 2023).

Distributed Shuffling

WASH advances LAWA by shuffling a small, randomly selected set of weights after each SGD step in each model. This both stochastically aligns and diversifies models, making post-hoc parameter averaging effective for single-model inference at ensemble-level performance, while drastically reducing synchronization cost relative to full-model all-reduce or EASGD-style schemes. Empirical evidence shows linear interpolations among WASH-aligned models remain in low-loss basins (Fournier et al., 27 May 2024).

6. Interaction with Learning Rate Schedules and Complementary Schemes

Averaging strategies, including LAWA and EMA, act as variance reduction mechanisms that mimic the effect of learning-rate annealing but do not fully replace it. Averaging provides a smoothing ("cooling") effect on parameter trajectories, often enabling larger LR schedules and faster exploration of flat minima. However, best performance arises from combining LAWA with standard LR decay; pure offline averaging without annealing cannot match the benefits of the combined approach (Ajroldi et al., 10 Feb 2025, Sanyal et al., 2023). Early weight averaging can serve as a surrogate for late-stage decay but must be initiated after achieving mode connectivity for best results (Sanyal et al., 2023).

Complementarity with sharpness-aware minimization (SAM) and gradient diversity regularization further extends LAWA's benefit for OOD generalization and few-shot adaptation (Xu, 14 Jan 2025).

7. Current Limitations and Open Directions

Despite robust empirical and theoretical support, LAWA exhibits several limitations:

Effectiveness can degrade if the averaging window encompasses diverging regions in parameter space or if batch-norm statistics are improperly synchronized (Sanyal et al., 2023).
For model merging across widely differing tasks, naive averaging can induce destructive interference unless centered, low-rank projections are used (Choi et al., 11 Dec 2024).
Memory cost scales with number of checkpoints/adapters; efficient storage management is required for large $k$ or $T$ (Kaddour, 2022, Ajroldi et al., 10 Feb 2025).
The optimal choice of $k$ , checkpoint frequency, and initiation time remains problem-dependent, with hyperparameter sweeps required in practice.
Not all architectures (e.g., architectures heavily dependent on batch-norm or very deep residual networks) respond equally to early or sliding-window LAWA.

Open questions include adaptive schemes for weighting checkpoints, extension of LAWA-based merging to more diverse model classes, theoretical analyses for nonconvex and non-i.i.d. distributions, and automatic detection of optimal averaging intervals and subsets.

Key Sources: (Kaddour, 2022, Chronopoulou et al., 2023, Gu et al., 2023, Sanyal et al., 2023, Jia et al., 2023, Fournier et al., 27 May 2024, Wang et al., 20 Nov 2024, Choi et al., 11 Dec 2024, Xu, 14 Jan 2025, Ajroldi et al., 10 Feb 2025)