Papers
Topics
Authors
Recent
2000 character limit reached

Mean Squared Overlap (MSO)

Updated 8 January 2026
  • Mean Squared Overlap (MSO) is a quantitative metric that measures the similarity between components using geometric and statistical methods in high-dimensional models.
  • It employs rigorous definitions in both weight-space and activation-space to evaluate correlations and functional overlaps, crucial in MoE networks and spin glass systems.
  • MSO provides actionable insights into model specialization and phase behavior, informing the design of neural architectures and the analysis of disordered systems.

The mean squared overlap (MSO) is a quantitative metric used to characterize the statistical or geometric similarity between components in high-dimensional models, serving as a key indicator of correlation structure, specialization, or phase behavior depending on context. MSO has been employed both in the analysis of mixture-of-experts (MoE) neural architectures—measuring diversity and functional overlap between experts—and in the mean field theory of spin glass models—acting as an order parameter reflecting phase organization and self-averaging. Its formalism, empirical behavior, and interpretive value are deeply tied to the class of system and methodological context in which it is deployed.

1. Formal Definitions of Mean Squared Overlap

Weight-Space MSO in MoE

For a MoE layer containing NN experts, each with weight matrix WiW_i, the weight-space MSO is computed by first flattening and 2\ell_2-normalizing each weight matrix:

W~i=vec(Wi)vec(Wi).\tilde W_i = \frac{\mathrm{vec}(W_i)}{\|\mathrm{vec}(W_i)\|}.

The mean squared overlap across all expert pairs is

MSOweight=2N(N1)1i<jNW~i,W~j2.\mathrm{MSO}_\text{weight} = \frac{2}{N(N-1)} \sum_{1 \le i < j \le N} |\langle \tilde W_i, \tilde W_j\rangle|^2.

This metric measures geometric orthogonality in parameter space: lower values indicate more orthogonal (diverse) expert weights (Kim, 1 Jan 2026).

Activation-Space MSO in MoE

During inference, each input xx selects a subset S(x)\mathcal S(x) of kk experts (typically k=2k=2, under top-2 routing). For the normalized activations hi(x)h_i(x) from the selected experts, the functional overlap is quantified as

MSOact=ExD[2k(k1)i<jS(x)(hi(x),hj(x)hi(x)hj(x))2].\mathrm{MSO}_\text{act} = \mathbb{E}_{x\sim\mathcal D} \left[ \frac{2}{k(k-1)} \sum_{i<j\in\mathcal S(x)} \left(\frac{\langle h_i(x), h_j(x)\rangle}{\|h_i(x)\| \|h_j(x)\|}\right)^2 \right].

This measures output alignment or specialization in activation space, evaluated on actual data (Kim, 1 Jan 2026).

MSO in Spin Glass Theory

In the Ghatak–Sherrington mean-field model for spin glasses, the overlap between two replicas σ1,σ2\sigma^1, \sigma^2 of size NN is

R1,2=1Ni=1Nσi1σi2.R_{1,2} = \frac{1}{N} \sum_{i=1}^N \sigma_i^1 \sigma_i^2.

The mean squared overlap is then

MSO=limNE[R1,22]=(ER1,2)2,\mathrm{MSO} = \lim_{N\to\infty} \mathbb{E}[R_{1,2}^2] = (\mathbb{E} R_{1,2})^2,

where the expectation is taken with respect to the quenched Gibbs measure and the limiting value is given by the square of the replica-symmetric overlap parameter (Sheng et al., 2023).

2. Methodologies for MSO Measurement

MoE Empirical Protocol

MSO was empirically measured in 130M-parameter NanoGPT-MoE models with 6 MoE layers, each layer containing 8 experts, using top-2 routing. For each regularization parameter λ\lambda in {0,0.001,0.005,0.01,0.05,0.1,0.2}\{0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2\}, weight-space MSO was computed per layer (then averaged), while activation-space MSO was computed over 1K validation samples, considering only top-2 expert activations per input and unweighted by gating scores. Data splits included TinyStories, WikiText-103, and PTB across multiple random seeds. Orthogonality regularization was implemented as

Lorth=i<jW~i,W~j2L_\text{orth} = \sum_{i<j} |\langle \tilde W_i, \tilde W_j\rangle|^2

and added to the standard language modeling objective (Kim, 1 Jan 2026).

Spin Glass Analytical Approach

In the mean-field Ghatak–Sherrington model, the MSO was analyzed using the moment method and cavity approach to establish joint central limit theorems for overlap and self-overlap arrays in the high-temperature regime. Explicit expressions for finite-NN corrections to E[R1,22]\mathbb{E}[R_{1,2}^2] are given in terms of model-specific constants A22A_2^2, A12A_1^2, and A02A_0^2. The regime of validity includes all β<β\beta < \beta', for fixed external and crystal fields (Sheng et al., 2023).

3. Key Empirical and Theoretical Results

MoE: Effect of Orthogonality Loss

Introduction of an explicit orthogonality loss in MoE layers produced a counterintuitive increase in weight-space MSO:

  • Baseline (λ=0\lambda=0): MSOweight=5.43×104\mathrm{MSO}_\text{weight} = 5.43\times10^{-4}
  • λ=0.001\lambda=0.001: 7.52×1047.52\times10^{-4} (+39%)
  • λ=0.01\lambda=0.01: 1.16×1031.16\times10^{-3} (+114%)
  • Trend persists for larger λ\lambda (up to 2.78×1032.78\times10^{-3} at λ=0.2\lambda=0.2)

Activation-space MSO remained stable across the sweep (\sim0.57–0.59), showing no substantial effect of regularization. The Pearson correlation between MSOweight\mathrm{MSO}_\text{weight} and MSOact\mathrm{MSO}_\text{act} was statistically insignificant (r=0.293r=-0.293, p=0.523p=0.523) (Kim, 1 Jan 2026).

Summary Table: Weight vs. Activation MSO (TinyStories sweep)

λ\lambda Weight MSO Activation MSO Ratio (act/weight)
0 5.43×1045.43\times10^{-4} 0.572 1053×
0.001 7.52×1047.52\times10^{-4} 0.581 773×
0.01 1.16×1031.16\times10^{-3} 0.577 496×
0.1 2.04×1032.04\times10^{-3} 0.593 290×
0.2 2.78×1032.78\times10^{-3} 0.564 203×

Spin Glass: High-Temperature Limit

In the Ghatak–Sherrington model, the high-temperature (replica-symmetric) mean squared overlap is MSO=q2\mathrm{MSO} = q^2 with q=E[R1,2]q = \mathbb{E}[R_{1,2}] determined by fixed-point equations. The finite-size correction is

E[R1,22]=q2+A22+2A12+A02N+o(N1),\mathbb{E}[R_{1,2}^2] = q^2 + \frac{A_2^2 + 2A_1^2 + A_0^2}{N} + o(N^{-1}),

where the Aj2A_j^2 are closed-form functions of the model parameters (Sheng et al., 2023).

4. Interpretations and Theoretical Significance

MoE: Disconnection between Weight and Activation Diversity

Weight-space orthogonality constraints, as implemented via a Frobenius-trace penalty, act only on the leading trace tr(WiTWj)=0\mathrm{tr}(W_i^TW_j) = 0, not the full matrix product, creating a mismatch between the geometric regularization objective and actual weight configuration. This perturbation can cause weights to drift into higher-magnitude directions, increasing pairwise inner products—and thus MSO—paradoxically under explicit regularization. Nonlinearities (e.g., SiLU) and LayerNorm further disrupt any correspondence between weight and activation overlaps; even if WiWjW_i \perp W_j, the post-activation outputs hi,hjh_i, h_j can remain highly aligned on natural data. The input distribution in natural language tasks, living on a low-rank manifold, can also force different experts to exhibit similar activations (Kim, 1 Jan 2026).

Spin Glasses: MSO as an Order Parameter

In spin glass systems, MSO encapsulates the self-averaging and long-range order of the system, stabilizing to the square of the replica-symmetric overlap in the thermodynamic limit. The value of q=E[R1,2]q = \mathbb{E}[R_{1,2}] tracks the system's phase and responds predictably to external and crystal fields. A higher crystal-field parameter DD tends to reduce MSO, while a stronger external field hh increases it. The finite-size corrections and covariance structure from the central limit theorem provide a rigorous quantitative framework for fluctuations and critical behavior near the high-temperature limit (Sheng et al., 2023).

5. Implications for Model Specialization and Performance

MoE: Limitations and Design Recommendations

Empirically, weight-space MSO is not a reliable proxy for expert specialization: orthogonality regularization neither achieves its geometric goal nor reliably improves downstream performance. For instance, on TinyStories, orthogonality loss marginally worsens perplexity (+0.9%, p=0.727p=0.727), while WikiText-103 shows a marginal improvement (–0.9%, p<0.05p<0.05), and PTB exhibits high variance with no consistent effect. The lack of correlation between weight and activation MSO suggests that MoE diversity must be promoted at the functional (activation) level or via routing and diversity objectives acting on actual expert outputs (Kim, 1 Jan 2026).

Spin Glass Theory: Control via Field Parameters

In the statistical mechanics of disordered systems, MSO provides a direct readout of the overlap structure and is modulated via field parameters DD and hh. This supports precise analytical investigation into the relationship between microscopic correlations and macroscopic phases in high-dimensional, interacting systems (Sheng et al., 2023).

6. Directions for Future Investigation

A plausible implication emerges that, in neural MoE systems, activation-space diversity metrics (such as MSOact\mathrm{MSO}_\text{act} or analogous losses) and routing-aware objectives may provide more direct and robust mechanisms for specialization and performance gains, in contrast to weight-space geometric regularization. In spin glass models, further refinement of MSO analysis beyond the high-temperature regime—especially in context of phase transitions or competing disorder—remains a substantive avenue for theoretical exploration (Kim, 1 Jan 2026, Sheng et al., 2023).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mean Squared Overlap (MSO).