Papers
Topics
Authors
Recent
2000 character limit reached

One-Shot Permutation-Based Merging

Updated 10 December 2025
  • The paper presents a one-shot merging approach that resolves hidden-unit permutation symmetries via assignment algorithms, enabling immediate and functionally meaningful weight aggregation.
  • It details a pipeline that includes local representation extraction, sign/scaling alignment, permutation resolution using Hungarian matching or k-means clustering, followed by closed-form weight merging.
  • Empirical results show improved federated learning, ensembling, and multi-task performance with robust error bounds and zero-barrier mode connectivity in sufficiently wide neural architectures.

One-shot permutation-based merging refers to a class of model aggregation strategies that resolve hidden-unit permutation symmetries in neural networks (or similar latent-variable models) in a single, non-iterative step, enabling immediate weight merging of models trained independently. This methodology fundamentally addresses the inherent non-identifiability of permutable latent dimensions in architectures such as MLPs, CNNs, Transformers, and factor analysis models (e.g., ICA), which precludes naive averaging due to arbitrary ordering of hidden components. In the one-shot setting, optimal or near-optimal alignment of model subspaces or parameter columns is solved directly—most commonly via assignment or clustering algorithms—followed by closed-form merging and, if needed, robust or per-task corrective postprocessing. This paradigm has proven critical for federated learning, large-scale ensembling, continual learning, and multi-task systems, offering scalable, communication-efficient, and robust aggregation solutions.

1. Formal Problem Definition and Symmetry Structure

Given a set of independently trained models {θi}\{\theta_i\}, each with weight matrices W(i)W^{(i)}_\ell at layer \ell, the central challenge is that for many architectures, particularly those with exchangeable hidden units, the parameter space exhibits permutation symmetry. For models A and B, let PP denote a layer-appropriate permutation matrix; if f(W)f(W) is the model computation, then f(W)=f(PWPT)f(W) = f(PWP^T), yet WW and PWPTPWP^T are parameter-wise distinct. Naive averaging of unaligned weights destroys this structural equivalence, leading to suboptimal or even catastrophic performance upon merging. One-shot permutation-based merging solves, for each symmetry-susceptible layer, an assignment problem to find PP^* such that the permuted parameters PW(B)PTP^* W^{(B)}_\ell P^{*T} are maximally aligned with W(A)W^{(A)}_\ell, thereby enabling functionally meaningful averaging or aggregation (Ainsworth et al., 2022, Verma et al., 1 Mar 2024, Sharma et al., 16 Oct 2024).

2. Core Algorithmic Pipeline

The generic one-shot permutation-based merging workflow decomposes as follows:

  1. Local Representation Extraction: For each model or client, extract parameters or component estimates. In distributed ICA, each participant runs local ICA to obtain estimates A(k)A^{(k)} or, in neural nets, collects layer weights or feature activations (Jin et al., 26 May 2025, Ainsworth et al., 2022, Verma et al., 1 Mar 2024).
  2. Sign and Scaling Alignment (if needed): Normalize components to a standard sign convention (e.g., using reference dot-products in ICA) (Jin et al., 26 May 2025).
  3. Permutation Alignment:

    • Linear Assignment (Hungarian Matching): For each permutation-symmetric layer, solve the assignment

    P=argminPSdW(A)PW(B)F2P^* = \arg\min_{P \in S_{d}} \|W^{(A)} - PW^{(B)}\|_F^2

    via the Hungarian algorithm or equivalent, possibly using data-agnostic weights (row-wise) or data-dependent feature correlations (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024, Verma et al., 1 Mar 2024). - k-means Clustering: For ICA or distributed settings, vector columns from all models are pooled and clustered using kk-means, with the cluster assignment resolving cross-client permutations (Jin et al., 26 May 2025). - Task/Component-Specific Interventions: For complex architectures (e.g., Transformers), permutations may be decomposed block-wise (e.g., per attention head), using head-level and intra-head linear assignments based on feature correlations, and cascade through skip/residual paths (Verma et al., 1 Mar 2024).

  4. Aggregate or Merge:

    • Averaging: Compute merged weights as the average of aligned weights,

    W(M)=12(W(A)+PW(B)PT)W^{(M)}_\ell = \frac{1}{2}(W^{(A)}_\ell + P_\ell W^{(B)}_\ell P_\ell^T)

    or, for KK-way merges,

    Wmerge=1K+1(W0+i=1KW~i)W_{merge} = \frac{1}{K+1} \left( W^0 + \sum_{i=1}^{K} \widetilde{W}^i \right )

    where W~i\widetilde{W}^i are permuted expert weights (Sharma et al., 16 Oct 2024). - Robust Aggregation: In federated or heterogeneous settings, further aggregate within clusters using robust estimators such as the geometric median (Jin et al., 26 May 2025).

  5. Optional Postprocessing: Address global scaling, affine misalignment, or variance collapse via per-task, per-layer rescaling and shifting (TACT), particularly in multi-task or non-local merge scenarios (Sharma et al., 16 Oct 2024).

3. Mathematical Formulations and Theoretical Guarantees

Permutation alignment is cast as a (layerwise) optimization over the symmetric group: minPSdAPBPTF2\min_{P \in S_d} \|A - PBP^T\|_F^2 or, in presence of both row and column permutation ambiguities,

(P,Q)=argminP,QSnAPBQTF2(P, Q) = \arg\min_{P, Q \in S_n} \|A - PBQ^T\|_F^2

This is formally a quadratic assignment problem; practical approaches decouple layers and solve per-layer assignments via the Hungarian algorithm or, upon relaxing to orthogonality, via Procrustes analysis (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024).

In the federated ICA setting (RF-ICA), after permutation-alignment by kk-means, aggregation by geometric median provides worst-case recovery error (ignoring sign flips) bounded by quantiles of local estimation error,

maAπ(a)2pa2pa116ϵ/ΔQa(pa)\|m_a - A^*_{\pi(a)}\| \leq \frac{2p_a}{2p_a - 1 - 16\epsilon/\Delta} Q_a(p_a)

where Qa(pa)Q_a(p_a) is the pap_a-th quantile of in-cluster errors, Δ\Delta is the inter-component separation, and ϵ\epsilon measures local mean-square estimation error. For orthogonal mixing, the total squared Frobenius recovery error is

amaAπ(a)2r2Qdata(1p)\sum_a \|m_a - A^*_{\pi(a)}\|^2 \lesssim \frac{r^2}{Q_{data}(1-p^*)}

where Qdata(1p)Q_{data}(1-p^*) is a quantile of the sample size distribution (Jin et al., 26 May 2025).

For neural networks, matched merging can yield zero-barrier linear mode connectivity when the architectures are sufficiently wide and permutations are correctly resolved, substantiating the single-basin hypothesis modulo permutation symmetry (Ainsworth et al., 2022, Verma et al., 1 Mar 2024).

4. Specialized Methodologies Across Model Families

Domain Permutation Resolution Aggregation Postprocessing
Federated ICA kk-means clustering Geometric median None
MLPs/CNNs Hungarian weight matching Averaging (Potential) Rescaling
Transformers Blockwise Hungarian on features (MHA, FF), head-aligned Averaging Optional residual/embedding alignment
Multi-task/Non-local Bi-assignment per layer (row, col) Averaging Affine (TACT) correction

In Transformers, permutations must be applied distinctly to MHA (keys/queries/values), FFNs, residual streams, and embeddings, due to architectural constraints imposed by residual and layer normalization mechanisms. The optimal permutation per block is solved using linear assignments on standardized cross-model activations; output projections and input embeddings are permuted to maintain functional equivalence (Verma et al., 1 Mar 2024).

In ICA and federated learning, kk-means over pooled column estimates resolves permutations under distributional heterogeneity; subsequent geometric median aggregation is robust in the presence of faulty or low-sample clients and provides explicit misclustering and error bounds (Jin et al., 26 May 2025).

In multi-task scenarios or if experts are fine-tuned far from the base, per-task output scaling, post-merge, is crucial. Given μi,σi\mu_\ell^i, \sigma_\ell^i per-task activation means/variances (and corresponding merged stats), the correction

Amerge,(x)=αiAmerge,(x)+βiA'_{merge,\ell}(x) = \alpha_\ell^i \circ A_{merge,\ell}(x) + \beta_\ell^i

with αi=σi/σ^i\alpha_\ell^i = \sigma_\ell^i/\hat{\sigma}_\ell^i, βi=μiαiμ^i\beta_\ell^i = \mu_\ell^i - \alpha_\ell^i \hat{\mu}_\ell^i, restores expert statistics, mitigating variance collapse (Sharma et al., 16 Oct 2024).

5. Computational Complexity and Implementation Considerations

  • Permutation Alignment: For a layer of width dd, each cost matrix computation and assignment via the Hungarian algorithm is O(d3)O(d^3). Over LL layers this yields O(=1Ld3)O(\sum_{\ell=1}^L d_\ell^3) (Ainsworth et al., 2022).
  • k-means Clustering: For KK clients and rr components (ICA), Lloyd’s algorithm over KrKr vectors incurs O(TKr2)O(TKr^2) cost per TT iterations (Jin et al., 26 May 2025).
  • Geometric Median: In clusters of size KK, each Weiszfeld iteration is O(Kr)O(Kr); convergence is rapid due to convexity (Jin et al., 26 May 2025).
  • Communication Cost (federated): One-shot submission of O(r2)O(r^2) floats per client, totaling O(Kr2)O(Kr^2) (Jin et al., 26 May 2025).
  • Affine Correction: Per-task, per-layer statistics can be gathered by a single forward pass per expert and per merged model (Sharma et al., 16 Oct 2024).

In practice, alignment and merge steps are negligible compared to training, and even models with hundreds of millions of parameters (e.g., ResNet50, BERT-base) can be aligned in under a minute on modern hardware (Ainsworth et al., 2022, Verma et al., 1 Mar 2024).

6. Empirical Findings and Domain-Specific Performance

  • Federated ICA: RF-ICA (clustering + geometric median) consistently exceeds simpler column-mean and unclustered GMs. RF-ICA tolerates up to nearly 50% corrupted clients and achieves tight error bounds even under extreme data heterogeneity (Jin et al., 26 May 2025).
  • MLPs/CNNs: Zero-barrier linear mode connectivity (after permutation alignment) holds for wide networks. Narrow networks or early checkpoints can retain nontrivial barriers due to insufficient representational redundancy (Ainsworth et al., 2022).
  • Transformers: When aligning both feed-forward and attention blocks, the loss barrier under model interpolation is reduced by 7×\times (nearly zero) for MLM objectives. Feature correlations between independently trained seeds significantly increase after permutation alignment, supporting convergent feature learning (Verma et al., 1 Mar 2024).
  • Non-local Multi-task Merging: Without task-specific postprocessing, merged models see severe performance degradation (variance collapse), with average normalized accuracy as low as ~45%. Simple TACT affine correction restores average and worst-case task accuracy to ~86% and ~78% respectively, closing the performance gap by 30–60 percentage points (Sharma et al., 16 Oct 2024).

7. Limitations, Extensions, and Open Questions

One-shot permutation-based merging is subject to the following practical and theoretical caveats:

  • Width and Training Progress: Unimodal (zero-barrier) connectivity emerges chiefly in sufficiently wide, well-trained models (Ainsworth et al., 2022).
  • Architectural Constraints: Depthwise convolutions, groupnorms, or heavily entangled architectures may sharply restrict permissible permutations (Ainsworth et al., 2022).
  • Fine-tuning and Non-locality: Experts drifting far from a shared initialization exhibit increased divergence, necessitating output or intermediate normalization to avoid variance collapse (Sharma et al., 16 Oct 2024).
  • Residual Paths and LayerNorm: In Transformers, tied symmetry constraints imposed by skip connections and normalization limit the degrees of freedom available for permutation, reducing potential for barrier elimination (Verma et al., 1 Mar 2024).
  • Data-Dependence: Activation-matching can increase alignment fidelity over pure weight-matching, but requires auxiliary data and incurs additional compute (Ainsworth et al., 2022, Verma et al., 1 Mar 2024).

A plausible implication is that further research into joint/global (rather than strictly local/layerwise) permutation optimization, relaxation to soft alignments (e.g., optimal transport), and task-conditional normalization techniques could extend applicability to even more heterogeneous or architecturally intricate ensembles.


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to One-Shot Permutation-Based Merging.