PSO-Merging: Optimizing Expert Model Integration
- PSO-Merging is a gradient-free framework using particle swarm optimization and sparsification to efficiently integrate expert neural models for multitask applications.
- The method constructs a diverse initialization swarm from both expert and sparsified models, ensuring robust search and rapid convergence in under 10 steps.
- Empirical evaluations show that PSO-Merging achieves high multitask performance with minimal memory requirements compared to traditional gradient-based approaches.
Model merging leverages the strengths of multiple expert models to produce unified multitask systems without prohibitive retraining. PSO-Merging is a data-driven, gradient-free merging framework that utilizes Particle Swarm Optimization (PSO) for scalable, efficient, and high-quality integration of expert models, designed to address the computational inefficiency and weak performance of previous gradient-based and gradient-free approaches for large neural models (Zhang et al., 27 Aug 2025).
1. Motivation and Problem Formulation
The objective of PSO-Merging is to construct multitask models by merging a set of task-specific expert models —each derived from a common pre-trained base model —such that the resulting parameters exhibit competent multitask performance.
Traditional data-independent merging (e.g., arithmetic or fixed-rule methods) lacks data-driven guidance and cannot maximize performance across tasks. Gradient-based data-driven methods, although more adaptive, are computationally expensive for large-scale LLMs (often requiring intractable amounts of GPU memory and training time), especially where repeated backward passes are prohibitive. Previous gradient-free, population-based optimizers such as CMA-ES exhibit slow convergence, wasting evaluations on unpromising candidate solutions.
PSO-Merging seeks to find a parameter combination maximizing the mean task score:
where measures task performance using the merged parameters .
2. Swarm Construction and Initialization
PSO-Merging diverges from conventional PSO random initialization by constructing the initial swarm from available expert and sparsified expert models, specifically:
- The base pre-trained model
- All available expert models
- Sparsified expert models created via structured dropout on the difference
Sparsification proceeds as follows (for expert and parameter dimension ):
where denotes elementwise multiplication, and is the drop rate. This design not only generates higher population diversity, but—by partially masking parameter deltas—addresses parameter conflicts between experts.
3. PSO-Merging Optimization Procedure
The population of parameter vectors (particles) is evolved over optimization steps using standard PSO update rules. Each particle maintains:
- Its current parameter vector
- Velocity vector
- Personal best parameters
- The global best parameters —i.e., the parameters with best average score among all tasks
Per iteration, for each particle :
- Fitness Evaluation: Compute average multitask score:
- Velocity Update:
where is the momentum coefficient, and are weights for global and personal attraction, and are sampled per update dimension.
- Position Update:
The global best is then re-evaluated across the updated set.
This update embodies a linear combination of the existing candidate, the global best, and the personal trajectory history with an additional momentum contribution (when ), as described in the optimization rule.
The process continues for a small number of steps (many experiments showed fast convergence—for example, 5–10 optimization steps typically suffice). The final merged model is the global best particle after steps.
4. Empirical Evaluation
PSO-Merging was tested across several high-profile base architectures and tasks:
- Flan-T5-Base (8 GLUE tasks)
- Llama-2-13B, Llama-3-8B, Mistral-7B-v0.3 (Covering instruction-following, mathematics, code, and science QA tasks)
Baseline comparisons involved both data-independent methods (e.g., Task Arithmetic, DARE-Linear, TIES-Merging) and data-driven approaches (e.g., Adamerging, Fisher-Merging, RegMean, CMA-ES-based Evo).
Key outcomes include:
- For Flan-T5-Base, PSO-Merging consistently matched or exceeded state-of-the-art task arithmetic and gradient-based approaches in average GLUE performance, with strong MNLI accuracy.
- On LLMs, PSO-Merging achieved consistently higher scores across instruction, math, and code tasks versus prior methods, both when merging three and four experts.
- Analysis demonstrated that PSO-Merging converges rapidly: particle scores reached high values within the first 5–10 steps.
- Incorporation of sparsified expert models and a moderate momentum coefficient () were found to be beneficial for convergence and solution quality.
In all tested scenarios, PSO-Merging required only inference-mode operations and operated with substantially lower memory overhead than data-driven, gradient-based methods (e.g., 14 GB for three 7B models).
5. Algorithmic Properties and Scalability
Notable characteristics of PSO-Merging include:
- Data-guided, gradient-free optimization: Requires only forward passes to evaluate lost functions; this reduces hardware demand and accelerates execution relative to full backward-based (gradient) approaches.
- Swarm diversity via sparsification: Expanding the swarm size through dropout-style delta sparsification aids in overcoming expert parameter conflicts and encourages robust search.
- Rapid convergence: Convergence is typically achieved in very few optimization steps (empirically <10 in most settings), making PSO-Merging practical for large models and high-throughput scenarios.
- Scalability: Demonstrates strong merging results as the number of experts increases and across large models, without degradation or excessive computational demand.
6. Comparison with Baselines and Existing Approaches
Method | Data-Driven | Gradient-Free | Fast Convergence | High Memory Demand | Pre-Merge Diversity |
---|---|---|---|---|---|
Task Arithmetic | No | Yes | Yes | No | Low |
Adamerging | Yes | No | Slow | Yes | Low |
CMA-ES (Evo) | Yes | Yes | Slow | Yes | Moderate |
PSO-Merging | Yes | Yes | Yes | No | High (via sparsification) |
PSO-Merging occupies an advantageous position by delivering both data-driven fitness and computational tractability, in contrast to approaches that either lack specific guidance or are impractically resource-intensive.
7. Limitations and Future Directions
A limitation acknowledged by the authors is that PSO-Merging, as designed, assumes all experts share the same pre-trained base architecture. Extending to merging expert models originating from different base architectures remains an open direction for research. The method also assumes the availability of suitable evaluation sets for data-driven scoring.
Possible future work includes:
- Adapting PSO-Merging for heterogeneous expert bases (architecturally diverse models)
- Exploration of advanced fitness functions for nuanced multitask trade-offs
- Hybridization with local fine-tuning or structure-aware regularization post-merge
Summary
PSO-Merging is a practical, data-driven, and efficient algorithmic framework for large-scale model merging, using particle swarm optimization to balance exploration and exploitation, incorporating expert and sparsified expert initializations, and converging rapidly with minimal hardware overhead (Zhang et al., 27 Aug 2025). This approach delivers merged multitask models with state-of-the-art performance across a diverse range of tasks and model sizes, overcoming the key computational and optimization obstacles encountered by previous merging methods. The conceptual and practical advances illustrated by PSO-Merging signal a robust and scalable foundation for future research in robust model composition.