Does model souping require shared pretraining checkpoints?

Determine whether model souping via weight averaging for large language models requires that the combined checkpoints originate from the same pretraining checkpoint, or whether weight averaging remains viable when the checkpoints come from different pretraining runs.

Background

The paper evaluates Soup Of Category Experts (SoCE) on Llama 3–based models and notes that all evaluated checkpoints share the same pretraining initialization. The authors emphasize they only tested souping on final post-trained and aligned checkpoints and caution against mixing training stages.

This raises a fundamental question about the compatibility of checkpoints from different pretraining runs for weight-averaging methods. Establishing whether shared pretraining is necessary would clarify the applicability of SoCE and related souping techniques across more heterogeneous model populations.

References

We are currently unaware if souping requires the same pretrained checkpoint or can it work with different pretrained checkpoints as well.

— Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance (2511.13254 - Maiti et al., 17 Nov 2025) in Limitations, Subsection "General Method Limitations" (Application in the model training practice)

Does model souping require shared pretraining checkpoints?

Background

References

Related Problems