Transformer Minima Similarities

Updated 10 December 2025

Transformer Minima Similarities are defined by aligning permutation-symmetric transformer weights to reveal shared low-loss regions across independently trained models.
The method uses one-shot techniques including local model estimation, optimal assignment via algorithms like Hungarian, and direct averaging or robust aggregation.
Empirical studies show that such merging reduces variance collapse and improves robustness in settings like federated learning and multi-task model aggregation.

One-shot permutation-based merging refers to a class of model aggregation techniques that align and merge the weights or estimators of independently trained neural networks or statistical models in a single pass, accounting for permutation symmetries among latent units or features. These approaches address the fact that, due to inherent symmetries, independently trained models can represent semantically equivalent solutions that are permutations of one another, preventing direct averaging or aggregation. One-shot permutation-based merging constructs explicit permutation alignments—typically via combinatorial optimization or clustering—followed by a single aggregation step, with no subsequent fine-tuning or iterative matching. This framework has broad relevance in distributed learning, federated ICA, multi-task expert merging, and inter-seed neural network alignment.

1. Foundations and Formalization

The central insight underlying permutation-based merging is that various parameter symmetries, most notably hidden unit or component permutations, do not alter the function realized by many neural architectures or statistical models (e.g., MLPs, CNNs, ICA). As a result, aggregation or interpolation in parameter space is ill-posed unless these equivalence classes are respected. Given models $\{\theta^{(i)}\}_{i=1}^K$ (for $K$ agents or tasks), and for each model a set of permutation matrices $\{P_\ell^{(i)}\}$ acting on each layer or component $\ell$ , the goal is to find optimal $\{P_\ell^{(i)}\}$ such that all models are brought into a nearly aligned canonical form. The canonical merging formula is

$\theta_\text{merge} = \text{Aggregate}\bigl(\{\mathcal{P}(\theta^{(i)})\}_{i=1}^K \bigr),$

where $\mathcal{P}(\cdot)$ denotes the permutation action.

Depending on the domain, aggregation may be implemented via averaging, geometric median, or more sophisticated robust estimators; the alignment step generally involves solving an assignment (matching) or clustering problem to address permutation ambiguity (Ainsworth et al., 2022, Verma et al., 1 Mar 2024, Jin et al., 26 May 2025).

2. One-Shot Alignment Algorithms

The core one-shot algorithmic pipeline consists of the following steps:

Local model estimation: Each agent or expert independently trains its model, e.g., by fitting an ICA unmixing matrix (Jin et al., 26 May 2025), or optimizing a deep network on separate initializations or tasks (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024).
Feature or weight alignment: For each target component (layer, factor, head), solve for permutation matrices $\{P_\ell\}$ minimizing a distance or maximizing similarity, typically via the linear assignment/Hungarian algorithm. Cost criteria include squared Frobenius norm, cross-model feature correlations, or proxy losses specific to the architecture.
Merging step: After alignment, merge the parameters by averaging or, for robustness, via geometric median or multi-atlas fusion, depending on noise and data heterogeneity.

Notable instantiations include:

For feedforward networks, alignment is often layerwise via cost matrices derived from weight or activation similarities (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024).
For Transformers, alignment of multi-head attention is performed blockwise to respect head semantics, and embeddings/layer norms are permuted accordingly (Verma et al., 1 Mar 2024).
For federated ICA, cluster assignment via $k$ -means removes column (component) permutation, before robust median aggregation (Jin et al., 26 May 2025).

3. Theoretical Guarantees and Error Bounds

Theoretical analyses provide recovery guarantees and error bounds for the merged model in terms of the alignment and aggregation error. For federated ICA, under heterogeneity, the combined $k$ -means and geometric median approach yields the following (for $r$ components, $K$ clients, SNR and sample size parameters):

$k$ -means center alignment: For minimal local MSE $\epsilon$ , there exists a permutation $\pi$ such that

$\max_a \| \mu_a - A^*_{\pi(a)} \| \le \sqrt{7\epsilon}$

[Medians and cluster centers as in (Jin et al., 26 May 2025), Lemma 3.1].

Misclustering control:

$s_a \leq 16 \cdot (\epsilon_a / \Delta)$

where $s_a$ is the per-cluster misclustering rate, and $\Delta$ is minimal true component separation.

Geometric median aggregation:

$\|\text{GM} - x^*\| \le \inf_{p>1/2} \frac{2p}{2p-1} Q(p)$

with $Q(p)$ the $p$ -quantile of error magnitudes, yielding robustness to outliers (Jin et al., 26 May 2025).

End-to-end RF-ICA guarantee: For suitable $p_a$ ,

$\| m_a - A^*_{\pi(a)} \| \leq \frac{2p_a}{2p_a - 1 - 16\epsilon/\Delta} Q_a(p_a)$

[(Jin et al., 26 May 2025), Theorem 3.4].

A similar structure applies for deep neural network merging: alignment via permutation matching uncovers low-loss or even convex basins between solutions, yielding zero-barrier or low-barrier interpolation (Ainsworth et al., 2022). For FC networks and ResNets, wide and sufficiently trained networks satisfy these merging properties empirically.

4. Implementation Details Across Domains

The table below summarizes per-domain design choices in one-shot permutation-based merging.

Domain	Alignment Approach	Merge Function
Feedforward/CNN (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024)	Hungarian algorithm per layer	Averaging
Transformer (Verma et al., 1 Mar 2024)	Correlation + headwise/per-head block Hungarian	Averaging
ICA/federated (Jin et al., 26 May 2025)	$k$ -means clustering of components	Geometric median

The steps typically are:

Compute cross-model similarity or distance metrics.
Solve linear assignment (or, for $k$ -means, cluster assignment) to obtain permutations.
Apply permutations to align all non-output layers as allowed by architecture symmetries.
Merge aligned weights or estimators without further fine-tuning.

For multi-task or non-local merging, additional affine output correction per task compensates for "variance collapse"—the reduction in activation variance after naive merging—by rescaling and shifting merged activations to match each expert’s statistics. This step is mathematically given by per-layer, per-task affine parameters $\alpha^i_\ell, \beta^i_\ell$ , constructed from expert and merged activation means and standard deviations (Sharma et al., 16 Oct 2024).

5. Empirical Performance and Practical Implications

Empirical studies support the following observations:

Zero/barrier-free connectivity: On sufficiently wide MLPs, ResNets, and Transformers, permutation-based alignment followed by averaging yields either strictly convex or nearly barrier-free interpolation paths in loss space (Ainsworth et al., 2022, Verma et al., 1 Mar 2024).
Robustness in federated settings: For federated ICA, the RF-ICA $k$ -means + geometric median pipeline achieves strong robustness to heterogeneity, tolerating up to ~50% corrupted clients, and always outperforming naive mean/median methods in the presence of permutation ambiguity (Jin et al., 26 May 2025).
Effect of task diversity: In non-local merging (experts from different pretraining or foundation models), naive permutation-averaging yields "variance collapse," drastically lowering activation variance and harming accuracy. Per-task affine correction (TACT) restores performance, with observed average normalized accuracy increasing from ~45% to ~86% and worst-case from ~19% to ~78% in experiments on VGG16 with 4 tasks (Sharma et al., 16 Oct 2024).
Fine-tuning effects: In fine-tuned Transformers, loss barriers may remain non-convex post-merge, and the method's efficacy may be diminished due to reduced symmetry degrees of freedom or domain shift (Verma et al., 1 Mar 2024).

6. Limitations and Failure Modes

Known limitations include:

Architectural constraints: Models with few permutation symmetries (e.g., depth-wise/constrained architectures) may not admit useful alignment (Ainsworth et al., 2022).
Width and training phase: Narrow networks or those matched early in training fail to realize perfect basin alignment; permutations stabilize progressively during optimization (Ainsworth et al., 2022).
Feature misalignment: Data-agnostic matching may misalign functionally distinct units. Data-driven, activation-based assignment is more accurate but computationally demanding (Ainsworth et al., 2022, Sharma et al., 16 Oct 2024).
Domain shift: When held-out data for feature collection in Transformer merging are not representative, alignment becomes unreliable (Verma et al., 1 Mar 2024).

Additional failure appears in non-local merging when model outputs have locally shifted statistics; direct merging is susceptible to variance collapse, which is not mitigated by permutation matching alone but requires per-task affine correction (Sharma et al., 16 Oct 2024).

7. Representative Algorithms and Simulation Insights

For federated ICA, the RF-ICA one-shot protocol is as follows (Jin et al., 26 May 2025):

Each client performs local ICA to estimate unmixing matrix columns, up to signed-permutation.
Signs are resolved relative to a reference client.
$k$ -means clustering aligns columns from all clients to resolve global permutation ambiguity.
Within-cluster geometric median computation yields robust global estimators for each source.

In multi-expert neural network settings, the canonical algorithm (as in (Sharma et al., 16 Oct 2024)) involves alternating optimization of per-layer permutations (Hungarian or Procrustes), aggregation, and per-task affine correction for activation statistics.

Simulation protocols consistently validate these algorithms:

RF-ICA: Outperforms non-clustering baselines except when all clients are "good," and remains robust up to nearly 50% corruption (Jin et al., 26 May 2025).
Transformer merging: Merged paths exhibit reduced or negligible loss barriers; merging both FFN and MHA is critical for best performance (Verma et al., 1 Mar 2024).
Variance-corrected merging: TACT-corrected merges close up to 60 percentage points of the gap between naive perm-averaging and expert models’ performance in multi-task, non-local settings (Sharma et al., 16 Oct 2024).

In conclusion, one-shot permutation-based merging provides a theoretically principled and empirically robust framework for aggregating symmetrically-parameterized models, applicable across federated, multi-expert, and deep learning regimes. Its success depends not only on precise permutation alignment but also, in some settings, on subsequent correction for statistical pathology such as variance collapse. These methods uncover fundamental structural properties of neural and statistical models’ parameter spaces, including shared basins and the central role of symmetries (Ainsworth et al., 2022, Verma et al., 1 Mar 2024, Sharma et al., 16 Oct 2024, Jin et al., 26 May 2025).