Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 418 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

PSO-Merging: Merging Models Based on Particle Swarm Optimization (2508.19839v1)

Published 27 Aug 2025 in cs.LG and cs.AI

Abstract: Model merging has emerged as an efficient strategy for constructing multitask models by integrating the strengths of multiple available expert models, thereby reducing the need to fine-tune a pre-trained model for all the tasks from scratch. Existing data-independent methods struggle with performance limitations due to the lack of data-driven guidance. Data-driven approaches also face key challenges: gradient-based methods are computationally expensive, limiting their practicality for merging large expert models, whereas existing gradient-free methods often fail to achieve satisfactory results within a limited number of optimization steps. To address these limitations, this paper introduces PSO-Merging, a novel data-driven merging method based on the Particle Swarm Optimization (PSO). In this approach, we initialize the particle swarm with a pre-trained model, expert models, and sparsified expert models. We then perform multiple iterations, with the final global best particle serving as the merged model. Experimental results on different LLMs show that PSO-Merging generally outperforms baseline merging methods, offering a more efficient and scalable solution for model merging.

Summary

  • The paper introduces a novel gradient-free approach using PSO to merge fine-tuned models into a single multitask LLM.
  • It leverages pre-trained, sparsified experts and iterative velocity updates to balance exploration and exploitation efficiently.
  • Experimental results demonstrate superior multitask scores, rapid convergence, and significant memory efficiency over gradient-based methods.

PSO-Merging: Merging Models Based on Particle Swarm Optimization

Introduction

PSO-Merging introduces a data-driven, gradient-free approach for merging multiple fine-tuned LLM experts into a single multitask model, leveraging Particle Swarm Optimization (PSO) in the parameter space. The method addresses the limitations of both data-independent merging (which lacks task-specific adaptation) and gradient-based data-driven approaches (which are computationally prohibitive for large models). By initializing the PSO swarm with pre-trained, fine-tuned, and sparsified expert models, PSO-Merging efficiently searches for an optimal parameter combination that maximizes multitask performance. Figure 1

Figure 1: Overview of PSO-Merging, showing initialization with pre-trained, fine-tuned, and sparsified experts, and iterative update cycles in parameter space.

Methodology

Problem Formulation

Given a set of tasks T={τ1,…,τn}T = \{\tau_1, \ldots, \tau_n\} and corresponding fine-tuned experts {θ1,…,θn}\{\boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_n\} derived from a base model θ0\boldsymbol{\theta}_0, the objective is to merge these experts into a single model θmerged\boldsymbol{\theta}_{\mathrm{merged}} that performs well across all tasks.

PSO-Merging Algorithm

Initialization

  • The swarm is initialized with:
    • The pre-trained base model
    • All fine-tuned expert models
    • Sparsified versions of each expert (using Bernoulli masking on parameter deltas)
  • Sparsification mitigates parameter conflicts and increases the diversity of initial solutions.

Iterative Updates

  • Each particle (model) is evaluated on a small optimization set, with fitness defined as the average score across all tasks.
  • Particle velocities are updated using the canonical PSO formula:

vt(i)=w⋅vt(i−1)+c1r1(θgbest(i−1)−θt(i−1))+c2r2(θt,pbest(i−1)−θt(i−1))\boldsymbol{v}_t^{(i)} = w \cdot \boldsymbol{v}_t^{(i-1)} + c_1 r_1 (\boldsymbol{\theta}_{\mathrm{gbest}}^{(i-1)} - \boldsymbol{\theta}_t^{(i-1)}) + c_2 r_2 (\boldsymbol{\theta}_{t,\mathrm{pbest}}^{(i-1)} - \boldsymbol{\theta}_t^{(i-1)})

  • Positions are updated as:

θt(i)=θt(i−1)+vt(i)\boldsymbol{\theta}_t^{(i)} = \boldsymbol{\theta}_t^{(i-1)} + \boldsymbol{v}_t^{(i)}

  • After SS steps, the global best particle is selected as the merged model.

Intuitive Justification

  • The update rule is a data-guided linear combination of the current, personal best, and global best solutions, with momentum for exploration.
  • Sparsification in the initial swarm enables the search to avoid regions of high parameter conflict.

Experimental Results

PSO-Merging was evaluated on Flan-T5-Base, Llama-2-13B, Llama-3-8B, and Mistral-7B-v0.3, merging up to four experts per base model. The method was compared against a comprehensive set of baselines, including Task Arithmetic, DARE-Linear, TIES-Merging, DELLA-Merging, RankMean, Evo (CMA-ES), Adamerging, Fisher-Merging, and RegMean.

  • On Flan-T5-Base, PSO-Merging achieved the highest average multitask score (81.24), outperforming all baselines, with a notable improvement on MNLI.
  • On Llama-2-13B, Llama-3-8B, and Mistral-7B-v0.3, PSO-Merging consistently yielded the best average scores, with substantial gains in instruction-following and mathematical reasoning tasks. Figure 2

    Figure 2: Score trajectories for all particles under different momentum coefficients ww, illustrating rapid convergence and the effect of initialization.

  • Increasing the number of particles (via sparsification and inclusion of the base model) improved final performance.
  • PSO-Merging demonstrated rapid convergence, typically within 5–10 steps, and was robust to the choice of ww (optimal at w=0.2w=0.2).

Analysis

Hyperparameter Sensitivity

  • The momentum coefficient ww is critical: w=0.2w=0.2 balances exploration and exploitation, enabling all particles to converge to high scores.
  • Larger particle swarms (enabled by sparsification) facilitate better exploration and higher final scores.

Efficiency

  • PSO-Merging is significantly more memory-efficient than gradient-based methods (Adamerging, Fisher-Merging, RegMean), requiring only inference-time memory for each particle.
  • For three 7B models, PSO-Merging requires 14GB, compared to 28–42GB for gradient-based approaches.

Scalability

  • The method scales to merging four or more experts, maintaining superior performance and convergence speed.

Practical Implications

PSO-Merging provides a scalable, efficient solution for constructing multitask LLMs from existing fine-tuned experts, without the need for full retraining or expensive gradient computations. The approach is particularly suited for scenarios with limited task-specific data and large model sizes, where traditional gradient-based merging is infeasible.

  • The method is applicable to any scenario where multiple experts on the same base architecture are available.
  • The sparsification mechanism is essential for mitigating parameter conflicts and should be incorporated in practical deployments.
  • The optimization set can be small, enabling rapid prototyping and deployment.

Limitations and Future Directions

  • The current method assumes all experts are derived from the same base model. Extending PSO-Merging to heterogeneous architectures remains an open challenge.
  • Further research is needed to adapt the approach for cross-architecture or cross-family model merging.

Conclusion

PSO-Merging leverages swarm intelligence and data-driven fitness evaluation to efficiently merge multiple LLM experts into a single multitask model. The method achieves superior multitask performance, rapid convergence, and high memory efficiency compared to existing baselines. Its practical utility is evident in scenarios requiring the consolidation of diverse expert capabilities without retraining. Future work should explore its extension to heterogeneous model architectures and broader application domains.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

  • Generality across bases/architectures: No evaluation of merging experts trained on different base checkpoints or architectures (e.g., cross-family merges like Mistral + Llama, or dense + MoE). How to align incompatible parameter spaces?
  • Scalability to many experts: Only up to four experts are merged; no evidence or analysis for merging dozens/hundreds of experts commonly found in real-world model zoos. What are compute/memory and performance scaling laws w.r.t. number of experts?
  • Sensitivity to optimization data: Fitness uses very small labeled sets (e.g., 50 samples per GLUE task, 1:10 splits elsewhere). How does performance vary with data size, noise, domain shift, and label scarcity?
  • Label-free setting: The method assumes labeled fitness for accuracy/pass@1/judged win rate. How to define robust, effective unsupervised or weakly supervised fitness functions when labels are unavailable?
  • Overfitting risks: Limited exploration of overfitting to tiny optimization sets. How does the merged model generalize under distribution shift and to unseen tasks?
  • Variance across seeds: PSO and sparsification are stochastic, but no multi-seed runs or confidence intervals are reported. What is the variance in outcomes and success probability?
  • Hyperparameter sensitivity: Only the inertia term ww is explored. No systematic ablation for c1c_1, c2c_2, number of steps SS, sparsification drop rate pp, or particle count. What are robust default settings and interactions?
  • Task weighting and trade-offs: Fitness is an unweighted average across tasks. How do different task weights or multi-objective formulations (e.g., Pareto optimization) affect regressions and trade-offs?
  • Constraint handling: No mechanisms to constrain updates (e.g., velocity clipping, norm bounds, convex-hull constraints). Do constraints improve stability, prevent out-of-manifold drifts, or reduce regressions?
  • Per-layer/per-module search: PSO operates on full parameter vectors; no exploration of layer-wise, block-wise, or module-wise PSO (e.g., attention vs MLP), which may reduce conflicts and improve control.
  • Convergence theory and schedules: No theoretical guarantees or empirical paper of PSO inertia/acceleration scheduling (e.g., decaying ww) commonly used in PSO to balance exploration/exploitation.
  • Memory/time accounting: Reported memory numbers lack detail on storing multiple particles and velocities for 7B–13B models. What are precise wall-clock, FLOPs, and peak memory vs baselines as particle count grows?
  • Implementation practicality: Storing and updating velocity tensors for billions of parameters across multiple particles is nontrivial. Are memory mapping, sharded storage, or on-the-fly recombination required, and how do they affect speed/accuracy?
  • Sparsification design: Only DARE-style random sparsification with p=0.8p=0.8 is used. How do alternative sparsifiers (magnitude, Fisher, structured, per-layer) and different pp values impact conflicts and accuracy?
  • Initial particle composition: Limited analysis of how many and which particles (experts, sparsified experts, base) are most beneficial. What are diminishing returns and optimal portfolio construction strategies?
  • Comparison breadth: No comparison against Model Swarms or other recent iterative/data-driven mergers beyond CMA-ES; gradient-based baselines are omitted for larger models. How does PSO-Merging fare under equalized resource budgets and tuned baselines?
  • LoRA/adapter space merging: Only full-weight merging is studied. Can PSO search over LoRA deltas or low-rank subspaces for better efficiency, controllability, and memory use?
  • Expert quality and conflict handling: No diagnostics for identifying harmful experts or task conflicts prior to merging. Can PSO be augmented with expert selection, gating, or penalization for conflict-prone experts?
  • Domain/task coverage: Benchmarks are limited (GLUE, GSM8K, MBPP, AlpacaEval, SciQ). No evaluation on multilingual, long-context, code reasoning beyond MBPP, retrieval-Augmented tasks, or safety/toxicity/factuality.
  • Negative side effects: No analysis of catastrophic forgetting of base capabilities, calibration, hallucination rates, or robustness (adversarial/noisy inputs) after merging.
  • Alignment and safety preservation: Merging may degrade RLHF/DPO alignment properties. How to preserve refusal behavior, safety policies, and preference consistency during PSO updates?
  • Evaluation noise and judge bias: AlpacaEval uses an LLM judge (Llama-3.1-70B) with potential bias/variance; no agreement checks with human raters or multiple judges. How robust are wins to judge choice and prompt variations?
  • Statistical significance: No confidence intervals or statistical tests; single runs per configuration. Are observed gains statistically reliable across seeds and re-trains of experts?
  • Quantization compatibility: No results for post-merge 8-bit/4-bit quantization or QAT. Does PSO-Merging preserve quantization performance and calibration?
  • Licensing and provenance: Merging experts with heterogeneous licenses or data provenance may raise legal/ethical concerns. What are permissible combinations and compliance checks?
  • Incremental/continual merging: How to add new experts later without revisiting earlier data or re-running from scratch? Can PSO support warm-started or streaming merging regimes?
  • Runtime scaling with experts: No empirical curves of runtime/memory vs number of experts and particles. What are practical limits on commodity hardware?
  • Alternative fitness objectives: Beyond task scores, can validation loss/perplexity, ensemble agreement, uncertainty proxies, or contrastive objectives serve as reliable, label-free fitness signals?
  • Regularization toward base/expert: No use of regularizers (e.g., Fisher/Elastic Weight Consolidation) during PSO. Do such constraints mitigate regressions and stabilize updates?
  • Geometry of weight space: No analysis linking improvements to linear-mode connectivity or low-dimensional subspaces. Can PSO operate in learned subspaces (PCA, task subspaces) to improve sample efficiency?
  • Interpretability/probing: No paper of how merging alters representations or neuron/module functions. Can probes explain task transfer or interference patterns induced by PSO?
  • Reproducibility details: Exact seeds, hardware, dataset partitions, and trained expert checkpoints for Llama-3-8B/Mistral are not fully specified/released. Can others replicate results end-to-end?
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 177 likes.

Upgrade to Pro to view all of the tweets about this paper: