Limits of capability gains from random guessing and ensembling

Determine the extent to which post-training by sampling random Gaussian weight perturbations around pretrained weights, selecting top-performing perturbations by validation score, and ensembling their predictions via majority vote (RandOpt) can improve performance beyond the pretrained base model on downstream tasks; characterize whether these gains saturate as model size and the perturbation population size increase.

Background

The paper shows that in sufficiently large, well-pretrained models, many task-improving parameter perturbations exist in the local neighborhood of the pretrained weights, a phenomenon termed the thicket regime. Leveraging this, a simple post-training procedure—sampling random perturbations and ensembling the best—often matches or exceeds more complex baselines across multiple tasks.

However, the authors note that while empirical scaling curves suggest improvements, these curves appear to saturate with increasing model size and population size. The precise limits of how far random guessing and ensembling can push performance beyond the base model remain undetermined.

References

Our results leave open the question of exactly how far beyond the base model's abilities random guessing and ensembling can take us.

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights  (2603.12228 - Gan et al., 12 Mar 2026) in Limitations, paragraph "Capacity to Learn Dramatically New Skills?"