Evolutionary Optimization of Model Merging Recipes

Updated 16 December 2025

The paper introduces an automated evolutionary approach that discovers optimal model merging recipes using gradient-free methods to combine multiple pre-trained models efficiently.
The methodology involves encoding recipe parameters with simplex weights, sparsity thresholds, and advanced crossover/mutation strategies, while employing surrogate models for rapid fitness evaluation.
Empirical results demonstrate that evolutionary-tuned recipes yield state-of-the-art performance on cross-domain tasks and scale effectively to high-dimensional, mixed-parameter spaces.

Evolutionary optimization of model merging recipes refers to the automated discovery of parameterized strategies ("recipes") for combining the weights or structural components of multiple pre-trained or fine-tuned models into a single merged model, where recipe parameters are tuned by evolutionary algorithms or other black-box optimization methods. This approach is motivated by the observation that model merging—a procedure to synthesize a new model from a set of parent models without additional gradient-based training—often depends sensitively on recipe hyperparameters, and traditional hand-tuned heuristics are suboptimal or infeasible to scale. Evolutionary optimization frameworks, including population-based metaheuristics such as CMA-ES, genetic algorithms (GA), or NSGA-II, enable efficient, gradient-free search over this high-dimensional, structured, and model-dependent space. Recent advances include formalization of merging spaces (parameter-space and data-flow-space), surrogate benchmarks for rapid recipe evaluation, and new open-source libraries that democratize these techniques for large language and vision models.

1. Formal Problem Statement and Recipe Representations

A model merging recipe broadly defines how to construct a parameter vector (or an inference flow) for the merged model from the weights of multiple source models. In the canonical parameter-space formulation, given $N$ models with weights $\{W^{(k)}_\ell\}_{k=1}^N$ for transformer layer $\ell$ , a linear merge is: $W^*_\ell = \sum_{k=1}^N \alpha_\ell^{(k)} \cdot W^{(k)}_\ell$ with $\alpha_\ell$ in the $N$ -simplex and optional sparsification or sign-alignment (as in DARE or TIES) (Akiba et al., 19 Mar 2024).

More expressive recipes may include:

Per-layer sparsity thresholds $s_\ell \in [0,1]$ , defining mask functions: $Mask_{s}(W) = W \odot 1_{|W| \geq \tau_s}$ .
Data-Flow-Space (DFS) recipes: an indicator and routing matrix specifying arbitrary layer-hop paths and scaling matrices for compositional inference through layers across multiple models (Akiba et al., 19 Mar 2024).

For functional merging spaces, such as fusing LoRA adapters, recipes become continuous vectors encoding layer-wise sparsity (e.g., $\alpha \in [0,1]^N$ ) and scaling weights ( $\beta \in \mathbb{R}^N$ ) (Chen et al., 16 Sep 2025).

Mixed-type and hierarchical spaces admit both continuous and categorical genes: e.g., selecting source model per-layer, per-block splitting, or input-scaling factors (Akizuki et al., 2 Sep 2025).

2. Evolutionary Optimization Algorithms and Search Protocols

Evolutionary optimization of merging recipes proceeds by encoding recipes as individuals (genotypes) and evolving them via population-based, derivative-free methods:

Initialization: Randomly sample recipe vectors (e.g., $\alpha$ coefficients, sparsity rates, or split-points) from feasible domains.
Mutation: Apply Gaussian perturbation to real genes; e.g., $x \leftarrow x + \sigma N(0,I)$ for continuous vectors. For categorical genes, uniform re-sampling or one-of-k swaps are used (Akizuki et al., 2 Sep 2025).
Crossover: Implemented as simulated binary crossover (SBX), uniform crossover, or structured operators (e.g., SLERP-based for blockwise splits in M2N2 (Abrantes et al., 22 Aug 2025)).
Selection: Candidates are evaluated for fitness and selected via tournament selection, Pareto ranking (for multi-objective), or replacement strategies.

Popular frameworks include:

CMA-ES: Used for simplex-constrained $\alpha$ optimization and high-dimensional mixed recipe search (Khalifa et al., 5 Dec 2024, Chen et al., 16 Sep 2025).
GA/NSGA-II: Standard GA for single-objective, NSGA-II for multi-objective merging (e.g., accuracy on multiple tasks/languages) (Mencattini et al., 9 Feb 2025).
Custom evolutionary operators: M2N2 employs dynamic merging boundaries, attraction-based mate selection, and niche-preserving fitness (Abrantes et al., 22 Aug 2025).

A highly generic pseudocode template is:

for generation in range(max_gens):
    population = select(population, fitness)
    offspring = crossover_and_mutate(population)
    evaluate_fitness(offspring)
    population = elitist_replacement(population, offspring)
best_recipe = select_best(population)

Recipe vector dimensionality typically ranges from a handful (simplex weights, $N\le 16$ ) to hundreds (full PS+DFS or blockwise strategies).

3. Fitness Functions and Surrogate Evaluation

The core fitness of a merged model is its performance on held-out or validation data, typically reflected as average task metric over $T$ tasks: $R(\theta) = \frac{1}{T} \sum_{t=1}^T P_t(\theta)$ where $P_t(\theta)$ is the metric (accuracy, F1, pass@1) on task $t$ (Khalifa et al., 5 Dec 2024).

For black-box API-based merging, fitness is the cross-entropy loss or accuracy evaluated via queries to a Language-Model-as-a-Service endpoint (Chen et al., 16 Sep 2025).

To reduce computational cost, recent frameworks employ:

Surrogate models: Learned regressors (e.g., LightGBM) to predict $f_{\text{true}}(\theta)$ from $\theta$ ; trained on dense samples across recipe space (Akizuki et al., 2 Sep 2025).
IRT-based estimators: Item Response Theory is used to infer model ability from a small, random validation subset and extrapolate to full dataset via parametric modeling, allowing 50 $\times$ reduction in fitness evaluation cost (Mencattini et al., 9 Feb 2025).

Fitness can also include regularization (e.g., $\ell_1$ on sparsity coefficients), computational cost, or explicit diversity criteria.

4. Multi-objective and Diversity-aware Evolution

Combinatorial merging often poses conflicting objectives (e.g., code accuracy vs. alignment), requiring multi-objective search:

Scalarization: Macro-average (unweighted) or lexicographic ranking (Khalifa et al., 5 Dec 2024).
Multi-objective optimization: Explicit minimization of vector-valued objectives (e.g., $f_1 = 1 - \text{C-Eval}$ , $f_2 = 1 - \text{GSM8K}$ in MM-MO), with Pareto dominance and hypervolume improvement (qEHVI) guiding recipe selection (Li et al., 29 Jun 2024).

Advanced frameworks introduce diversity maintenance:

Resource sharing: Fitness of an individual is downweighted if many in the population excel on the same example, stabilizing population-level niche coverage (Abrantes et al., 22 Aug 2025).
Attraction metrics: Selection of parent models emphasizes complementary capabilities, facilitating efficient exploration of orthogonal skill sets (Abrantes et al., 22 Aug 2025).
NSGA-II crowding distance: Preserves non-dominated recipes spanning the Pareto front (Mencattini et al., 9 Feb 2025).

5. Key Empirical Findings and Practical Implementations

Empirical studies across large model pools and multiple tasks demonstrate:

Effectiveness: Evolutionary merging can produce models with superior Pareto-optimal tradeoffs, outperforming individual source models and hand-tuned heuristics by up to several absolute points in task benchmarks (Khalifa et al., 5 Dec 2024).
Combinatorial richness: Optimal recipes often utilize nearly all initial checkpoints with nonzero weights, contradicting the intuition that only the “best” or “intuitive” sources contribute (Khalifa et al., 5 Dec 2024).
Scalability: Efficient merging (via IRT-based surrogates, reduced datasets) enables multi-objective evolutionary search on consumer GPUs, enabling practical runs within ~10 hours (Mencattini et al., 9 Feb 2025).
Cross-domain and cross-lingual composition: Recipes discovered through evolutionary search yield emergent capabilities (e.g., Japanese Math LLMs surpassing 70B baseline models) and enable robust knowledge transfer between linguistically or functionally distinct sources (Akiba et al., 19 Mar 2024).

Representative empirical outcomes include:

Approach	Key Technique	Speedup/Result
MERGE $^3$ (Mencattini et al., 9 Feb 2025)	IRT-based fitness estimation	50 $\times$ faster than naive eval
M2N2 (Abrantes et al., 22 Aug 2025)	Diversity via resource sharing, attraction	State-of-the-art in LLM/VLM/SDXL
Evo-Merging (Chen et al., 16 Sep 2025)	API-based black-box CMA-ES	$>$ 10 F1 pt gain vs. LoRaHub

6. Methodological Variants and Open Questions

Surrogate benchmarks allow low-cost, reproducible comparison of evolutionary optimizers and recipe parameterizations, with public benchmarks facilitating fair methodology evaluation (Akizuki et al., 2 Sep 2025).
Multiway and blockwise partitioning: Recipes can generalize to multiple split-points (beyond single block boundaries), hierarchically grouping layers or submodules for flexible merging (Abrantes et al., 22 Aug 2025).
Integration with differentiable merging: Recent studies benchmark evolutionary merging against gradient-based alternatives (e.g., Differentiable Adaptive Merging), finding comparable performance in some cases but higher resource demands for evolutionary approaches (Gauthier-Caron et al., 10 Oct 2024).
Best practices: Recipe search should ensure diverse sampling, careful surrogate model validation (to avoid extrapolation errors), and selective inclusion of non-intuitive checkpoints to maximize performance and generalization (Akizuki et al., 2 Sep 2025, Khalifa et al., 5 Dec 2024).
Scalability: Optimization of high-dimensional recipe spaces (hundreds of parameters) remains a challenge; efficiency advances in surrogates and evaluation are critical for scaling to larger model sets (Mencattini et al., 9 Feb 2025).

7. Theoretical Justification and Future Directions

Flat-minima and generalization: Layer-wise parameter space averaging navigates towards flatter, better-generalizing optima, supporting the empirical superiority of evolutionary-tuned merges (Akiba et al., 19 Mar 2024).
Distributed knowledge aggregation: DFS routing and block-level mixing exploit the modular storage of knowledge, enabling emergent capabilities not present in any single source (Akiba et al., 19 Mar 2024, Abrantes et al., 22 Aug 2025).
Optimality guarantees: Surrogate-based and IRT-based evolutionary objective functions possess $\epsilon$ -stability, ensuring that near-optimal recipes discovered on reduced or approximate fitness landscapes translate to true performance preservation as validation sample size increases (Mencattini et al., 9 Feb 2025).
Open research: Directions include adaptive or online surrogate updating, evolving recipes for more than two sources, integration of privacy preservation (Kim et al., 23 Mar 2025), and extension to black-box or API-only environments at ultra-large scale (Chen et al., 16 Sep 2025).

Overall, evolutionary optimization frameworks provide a flexible and powerful methodology for automated discovery of high-performing model merging recipes, with applicability across language, vision, and diffusion models, and with robust empirical evidence for both their versatility and effectiveness.