Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 22 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

PSO-Merging: Optimizing Expert Model Integration

Updated 28 August 2025
  • PSO-Merging is a gradient-free framework using particle swarm optimization and sparsification to efficiently integrate expert neural models for multitask applications.
  • The method constructs a diverse initialization swarm from both expert and sparsified models, ensuring robust search and rapid convergence in under 10 steps.
  • Empirical evaluations show that PSO-Merging achieves high multitask performance with minimal memory requirements compared to traditional gradient-based approaches.

Model merging leverages the strengths of multiple expert models to produce unified multitask systems without prohibitive retraining. PSO-Merging is a data-driven, gradient-free merging framework that utilizes Particle Swarm Optimization (PSO) for scalable, efficient, and high-quality integration of expert models, designed to address the computational inefficiency and weak performance of previous gradient-based and gradient-free approaches for large neural models (Zhang et al., 27 Aug 2025).

1. Motivation and Problem Formulation

The objective of PSO-Merging is to construct multitask models by merging a set of task-specific expert models {θ1,,θn}\{\theta_1, \ldots, \theta_n\}—each derived from a common pre-trained base model θ0\theta_0—such that the resulting parameters θmerged\theta_\text{merged} exhibit competent multitask performance.

Traditional data-independent merging (e.g., arithmetic or fixed-rule methods) lacks data-driven guidance and cannot maximize performance across tasks. Gradient-based data-driven methods, although more adaptive, are computationally expensive for large-scale LLMs (often requiring intractable amounts of GPU memory and training time), especially where repeated backward passes are prohibitive. Previous gradient-free, population-based optimizers such as CMA-ES exhibit slow convergence, wasting evaluations on unpromising candidate solutions.

PSO-Merging seeks to find a parameter combination θmerged\theta_\text{merged} maximizing the mean task score:

θmerged=argmaxθ 1ni=1nscorei(θ)\theta_\text{merged} = \underset{\theta}{\arg\max} \ \frac{1}{n} \sum_{i=1}^n \text{score}_i(\theta)

where scorei(θ)\text{score}_i(\theta) measures task ii performance using the merged parameters θ\theta.

2. Swarm Construction and Initialization

PSO-Merging diverges from conventional PSO random initialization by constructing the initial swarm from available expert and sparsified expert models, specifically:

  • The base pre-trained model θ0\theta_0
  • All available expert models θ1,,θn\theta_1, \ldots, \theta_n
  • Sparsified expert models {θ~1,,θ~n}\{\tilde{\theta}_1, \ldots, \tilde{\theta}_n\} created via structured dropout on the difference (θiθ0)(\theta_i - \theta_0)

Sparsification proceeds as follows (for expert tt and parameter dimension ii):

mi(t)Bernoulli(p)m_i^{(t)} \sim \text{Bernoulli}(p)

θ~t=(1m(t))(θtθ0)1p+θ0\tilde{\theta}_t = \frac{(1 - m^{(t)}) \odot (\theta_t - \theta_0)}{1 - p} + \theta_0

where \odot denotes elementwise multiplication, and pp is the drop rate. This design not only generates higher population diversity, but—by partially masking parameter deltas—addresses parameter conflicts between experts.

3. PSO-Merging Optimization Procedure

The population of parameter vectors (particles) Θ={θj}\Theta = \{\theta_j\} is evolved over SS optimization steps using standard PSO update rules. Each particle maintains:

  • Its current parameter vector θt(i)\theta_t^{(i)}
  • Velocity vector vt(i)v_t^{(i)}
  • Personal best parameters pbesti\mathbf{pbest}_i
  • The global best parameters gbest\mathbf{gbest}—i.e., the parameters with best average score among all tasks

Per iteration, for each particle ii:

  1. Fitness Evaluation: Compute average multitask score:

f(θ)=1nj=1nscorej(θ)f(\theta) = \frac{1}{n} \sum_{j=1}^n \text{score}_j(\theta)

  1. Velocity Update:

vt(i)=wvt1(i)+c1r1(gbestθt1(i))+c2r2(pbestiθt1(i))v_t^{(i)} = w \cdot v_{t-1}^{(i)} + c_1 \cdot r_1 \cdot (\mathbf{gbest} - \theta_{t-1}^{(i)}) + c_2 \cdot r_2 \cdot (\mathbf{pbest}_i - \theta_{t-1}^{(i)})

where ww is the momentum coefficient, c1c_1 and c2c_2 are weights for global and personal attraction, and r1,r2U(0,1)r_1, r_2 \sim U(0,1) are sampled per update dimension.

  1. Position Update:

θt(i)=θt1(i)+vt(i)\theta_t^{(i)} = \theta_{t-1}^{(i)} + v_t^{(i)}

The global best is then re-evaluated across the updated set.

This update embodies a linear combination of the existing candidate, the global best, and the personal trajectory history with an additional momentum contribution (when w>0w > 0), as described in the optimization rule.

The process continues for a small number of steps (many experiments showed fast convergence—for example, 5–10 optimization steps typically suffice). The final merged model is the global best particle after SS steps.

4. Empirical Evaluation

PSO-Merging was tested across several high-profile base architectures and tasks:

  • Flan-T5-Base (8 GLUE tasks)
  • Llama-2-13B, Llama-3-8B, Mistral-7B-v0.3 (Covering instruction-following, mathematics, code, and science QA tasks)

Baseline comparisons involved both data-independent methods (e.g., Task Arithmetic, DARE-Linear, TIES-Merging) and data-driven approaches (e.g., Adamerging, Fisher-Merging, RegMean, CMA-ES-based Evo).

Key outcomes include:

  • For Flan-T5-Base, PSO-Merging consistently matched or exceeded state-of-the-art task arithmetic and gradient-based approaches in average GLUE performance, with strong MNLI accuracy.
  • On LLMs, PSO-Merging achieved consistently higher scores across instruction, math, and code tasks versus prior methods, both when merging three and four experts.
  • Analysis demonstrated that PSO-Merging converges rapidly: particle scores reached high values within the first 5–10 steps.
  • Incorporation of sparsified expert models and a moderate momentum coefficient (w=0.2w=0.2) were found to be beneficial for convergence and solution quality.

In all tested scenarios, PSO-Merging required only inference-mode operations and operated with substantially lower memory overhead than data-driven, gradient-based methods (e.g., 14 GB for three 7B models).

5. Algorithmic Properties and Scalability

Notable characteristics of PSO-Merging include:

  • Data-guided, gradient-free optimization: Requires only forward passes to evaluate lost functions; this reduces hardware demand and accelerates execution relative to full backward-based (gradient) approaches.
  • Swarm diversity via sparsification: Expanding the swarm size through dropout-style delta sparsification aids in overcoming expert parameter conflicts and encourages robust search.
  • Rapid convergence: Convergence is typically achieved in very few optimization steps (empirically <10 in most settings), making PSO-Merging practical for large models and high-throughput scenarios.
  • Scalability: Demonstrates strong merging results as the number of experts increases and across large models, without degradation or excessive computational demand.

6. Comparison with Baselines and Existing Approaches

Method Data-Driven Gradient-Free Fast Convergence High Memory Demand Pre-Merge Diversity
Task Arithmetic No Yes Yes No Low
Adamerging Yes No Slow Yes Low
CMA-ES (Evo) Yes Yes Slow Yes Moderate
PSO-Merging Yes Yes Yes No High (via sparsification)

PSO-Merging occupies an advantageous position by delivering both data-driven fitness and computational tractability, in contrast to approaches that either lack specific guidance or are impractically resource-intensive.

7. Limitations and Future Directions

A limitation acknowledged by the authors is that PSO-Merging, as designed, assumes all experts share the same pre-trained base architecture. Extending to merging expert models originating from different base architectures remains an open direction for research. The method also assumes the availability of suitable evaluation sets for data-driven scoring.

Possible future work includes:

  • Adapting PSO-Merging for heterogeneous expert bases (architecturally diverse models)
  • Exploration of advanced fitness functions for nuanced multitask trade-offs
  • Hybridization with local fine-tuning or structure-aware regularization post-merge

Summary

PSO-Merging is a practical, data-driven, and efficient algorithmic framework for large-scale model merging, using particle swarm optimization to balance exploration and exploitation, incorporating expert and sparsified expert initializations, and converging rapidly with minimal hardware overhead (Zhang et al., 27 Aug 2025). This approach delivers merged multitask models with state-of-the-art performance across a diverse range of tasks and model sizes, overcoming the key computational and optimization obstacles encountered by previous merging methods. The conceptual and practical advances illustrated by PSO-Merging signal a robust and scalable foundation for future research in robust model composition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)