Does model souping require shared pretraining checkpoints?
Determine whether model souping via weight averaging for large language models requires that the combined checkpoints originate from the same pretraining checkpoint, or whether weight averaging remains viable when the checkpoints come from different pretraining runs.
References
We are currently unaware if souping requires the same pretrained checkpoint or can it work with different pretrained checkpoints as well.
— Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
(2511.13254 - Maiti et al., 17 Nov 2025) in Limitations, Subsection "General Method Limitations" (Application in the model training practice)