Weighted MSE Fusion Methods

Updated 18 January 2026

Weighted MSE fusion is a technique that combines estimates or distributions with convex weights to minimize overall mean square error.
It includes v-fusion and f-fusion paradigms using arithmetic and geometric averages that balance bias, variance, and robustness.
Applications range from sensor networks and multi-target tracking to neural network layer fusion, providing tailored trade-offs between precision and stability.

Weighted mean square error (MSE) fusion methods form a class of information-aggregation techniques where estimates, probability distributions, or neural network layers are combined using schemes that explicitly optimize or analyze MSE under weighting constraints. The central logic is to produce a fused entity—be it a scalar, vector, probability density, or neural network parameter block—whose mean squared deviation from an unknown or desired target is minimized according to selected weights, which may reflect information quality, confidence, or architectural priorities. Weighted MSE fusion emerges across statistical signal processing, sensor network estimation, multi-target tracking, and deep learning initialization frameworks, each with context-specific formalism and performance trade-offs.

1. Formal Definitions and Fusion Paradigms

Weighted MSE fusion is defined over two chief paradigms:

v-fusion (“variable fusion”): The fusion of $n$ scalar or vector-valued random estimates $\{x_i\}_{i=1}^n$ . Each estimate is assigned a nonnegative weight $w_i$ , subject to $\sum_{i=1}^n w_i = 1$ .
f-fusion (“function fusion”): The fusion of $n$ posterior probability densities $\{f_i(x)\}_{i=1}^n$ , with the same convex weighting scheme.

For each paradigm, two fusion rules prevail:

Arithmetic Average (AA):
- v-fusion: $z_{AA} = \sum_{i=1}^n w_i x_i$
- f-fusion: $f_{AA}(x) = \sum_{i=1}^n w_i f_i(x)$
Geometric Average (GA):
- v-fusion: $z_{GA} = \prod_{i=1}^n x_i^{w_i}$
- f-fusion: $f_{GA}(x) = C^{-1} \prod_{i=1}^n f_i(x)^{w_i}$ , where $\{x_i\}_{i=1}^n$ 0 ensures normalization

Each fusion rule has a direct relationship to the weighted MSE criterion, particularly in AA, where optimal weights seek to minimize the overall fused MSE (Li et al., 2019).

2. MSE Analysis and Closed-Form Solutions

For a given “true” parameter $\{x_i\}_{i=1}^n$ 1, the fused MSE is defined as $\{x_i\}_{i=1}^n$ 2, encompassing both variance and bias.

v-fusion (AA):

For two estimates with MSEs $\{x_i\}_{i=1}^n$ 3, $\{x_i\}_{i=1}^n$ 4 and inter-correlation parameter $\{x_i\}_{i=1}^n$ 5: $\{x_i\}_{i=1}^n$ 6 For unbiased estimates, the formula reduces with $\{x_i\}_{i=1}^n$ 7 as the correlation coefficient $\{x_i\}_{i=1}^n$ 8 and $\{x_i\}_{i=1}^n$ 9.

v-fusion (GA):

No general closed-form exists for MSE unless the $w_i$ 0 are log-normal; otherwise, analysis involves the covariance structure of $w_i$ 1 and may require Monte Carlo estimation.

f-fusion (AA):

The fused MSE becomes a simple convex combination: $w_i$ 2 with $w_i$ 3, and is bounded: $w_i$ 4.

f-fusion (GA, Gaussian case):

Fusion of two Gaussians $w_i$ 5, $w_i$ 6 yields another Gaussian: $w_i$ 7 So

$w_i$ 8

as shown in (Li et al., 2019).

3. Fusion of Weighted Gaussian Mixtures and Practical Representations

In multi-target tracking, Probability Hypothesis Density (PHD) or Cardinalized PHD filters employ weighted Gaussian mixtures (GM):

AA (GM context): The mixture is fused simply by reweighting and summing all GM components, preserving their structure.
GA (GM context): The fusion results in a sum of products of Gaussian pairs, which is not itself a GM. Analytic approximations (e.g., ignoring cross-terms) are employed, yielding components with sharper peaks but reduced robustness to missed detections.

A summary of fusion characteristics in PHD-based multi-target tracking:

Fusion Rule	GM Structure After Fusion	Key Effect
AA	Retains all GM peaks	Over-dispersed, robust
GA	Sharper, fewer peaks	Suppresses false alarms, fragile to missed detections

Broader tails and retention of spurious components are characteristic of AA; GA yields tighter localization but is subject to peak collapse if any constituent GM is missing a mode (Li et al., 2019).

4. Extension to Neural Network Layer Fusion: MSE-Optimal Layer Fusion

Weighted MSE fusion also underlies algorithms for neural network initialization through layer fusion. For two sequential layers in a deep net, a single equivalent layer is sought that minimizes the expected squared norm of the difference from the original two-layer mapping.

Let $w_i$ 9 denote the input, $\sum_{i=1}^n w_i = 1$ 0 the output of the first layer, and $\sum_{i=1}^n w_i = 1$ 1 the output of the second. The goal is to find parameters $\sum_{i=1}^n w_i = 1$ 2 that minimize

$\sum_{i=1}^n w_i = 1$ 3

where $\sum_{i=1}^n w_i = 1$ 4 is a (possibly weighted) squared Mahalanobis norm.

The unique minimizer is: $\sum_{i=1}^n w_i = 1$ 5

$\sum_{i=1}^n w_i = 1$ 6

where $\sum_{i=1}^n w_i = 1$ 7 and covariances $\sum_{i=1}^n w_i = 1$ 8 are taken over the empirical distribution of $\sum_{i=1}^n w_i = 1$ 9 and $n$ 0 (Ghods et al., 2020).

Fusing $n$ 1 layers generalizes by regarding their cumulative mapping as $n$ 2 and applying the same closed-form formulas.

The 'FuseInit' method proceeds by successively fusing layer pairs in deep nets, initializing shallower Nets at weighted MSE-optimal points, followed by fine-tuning.

5. Optimal Weight Selection and Fusion Strategy

Weight selection in AA fusion rules is governed by minimization of the resulting MSE. For unbiased and uncorrelated estimates, the classical Millman or inverse-variance weighting applies: $n$ 3 For correlated cases, the optimal weights minimize $n$ 4 under $n$ 5, where $n$ 6 is the covariance matrix.

If covariance or cross-correlation is unknown, uniform weights deliver robust baseline performance; the AA fused MSE does not exceed that of the least accurate constituent (Li et al., 2019). In f-fusion for Gaussian pdfs, GA fusion with weights set proportional to the inverse variance is optimal under the exact known covariance scenario (Li et al., 2019).

6. Comparative Analysis: Performance, Robustness, and Trade-offs

Fusion rule selection is context-dependent:

v-fusion: The AA rule can in principle reach lower variance than any constituent if correlation is low or negative and weights are tuned optimally. The GA rule generally cannot match this at any $n$ 7.
f-fusion (Gaussian case): The GA rule gives consistently lower or equal variance (and thus MSE) than AA for any $n$ 8. GA is best for precise localization if one can tolerate modes being destroyed by missing data; AA is more robust to such data defects.
Gaussian Mixtures: AA preserves all mixture components, which can result in over-dispersion or false alarm retention. GA fusion provides sharper estimates but is highly sensitive to missed constituents.
Neural Network Fusion: MSE-optimal layer fusion yields a closed-form optimal initialization; successive application enables shallow networks to inherit the performance profile of deeper pre-trained nets, with rapid retraining convergence (Ghods et al., 2020).

7. Bayesian Monte Carlo Approaches to MSE-Optimal Fusion

In scenarios where cross-correlation parameters are unknown, a Bayesian framework can estimate the fused MSE-optimal weights. By assigning a prior to the joint error covariance and exploiting the conditional distribution of the off-diagonal blocks (inverted matrix-variate $n$ 9-distribution), samples of the unknown covariance are drawn:

For each sample, construct the optimal linear fusion weights according to the conditional structure,
Fuse the estimates accordingly,
Average over samples for final MMSE fusion statistics.

This approach outperforms covariance intersection—especially as the number of input nodes increases—achieving 10–20% lower MSE in simulation across multiple SNR regimes (Weng et al., 2013).

Weighted MSE fusion methods, through the diversity of formalism in AA/GA rules, Gaussian mixture frameworks, and neural network layer fusion, provide a unified mathematical foundation for the optimal aggregation of uncertain information, tailored via convex weighting, capable of precise theoretical characterization, and adaptable to a wide array of practical signal processing and learning architectures.