Source Attribution Percentage Matrix

Updated 8 October 2025

The Source Attribution Percentage Matrix is a quantitative framework that partitions multivariate pollutant data into source-specific percentage contributions using geometric estimators.
It addresses classical NMF limitations by ensuring scale invariance and relaxing strict purity and sparsity assumptions through convex hull geometry.
Numerical simulations confirm its consistency and accuracy in source recovery, making it a practical tool for air pollution control and regulatory assessment.

The Source Attribution Percentage Matrix is a quantitative framework devised for attributing observed multivariate signals—such as concentrations of air pollutants—to their contributing sources, resolving two major classical limitations in non-negative matrix factorization (NMF): non-uniqueness and the need for restrictive assumptions. By defining the estimand at the population level and constructing consistent geometric estimators, the matrix enables robust, interpretable partitioning of observed concentrations into percentages attributable to each source, while ensuring invariance to rescaling and relaxing classical NMF requirements (Jin et al., 4 Oct 2025).

1. Formal Definition and Identifiability

Let $Y \in \mathbb{R}_+^{n \times J}$ be the observed data matrix, with $n$ samples (e.g., time points) and $J$ features (e.g., pollutant concentrations). The standard NMF representation models $Y = W H$ where $W \in \mathbb{R}_+^{n \times K}$ encodes the sample-level contributions (emissions) from $K$ sources and $H \in \mathbb{R}_+^{K \times J}$ encodes the per-unit source profiles across pollutants.

The central construct, the Source Attribution Percentage Matrix $\Phi$ , is defined elementwise as: $\phi_{kj} = \frac{\mu_k H_{kj}}{ \sum_\ell \mu_\ell H_{\ell j} }$ where $\mu_k = E(W_{ik})$ is the expected emission from source $k$ . Thus, each $\phi_{kj}$ represents the population-level fraction of the concentration of pollutant $j$ attributable to source $k$ .

Key identifiability results show that $\Phi$ is uniquely defined under two conditions:

The emission process $\{W_i\}$ is stationary and ergodic (allowing empirical estimates of $\mu_k$ via sample averaging).
The emission distribution is probabilistically separable—i.e., puts positive probability on regions in the emission simplex close to each canonical direction, ensuring near-pure source patterns can be observed without requiring strict 'pure pixel' or sparsity constraints.

The matrix $\Phi$ remains invariant under arbitrary diagonal rescalings of $W$ and $H$ , unlike the factors themselves. This property is critical for the interpretation and comparison of source contributions across studies and measurement units.

2. Geometric Estimation Methodology

The estimation procedure exploits the conical geometry of the data induced by the NMF model:

Row-normalize each sample: $Y^*_{i} = Y_i / r_i$ , where $r_i = \sum_j Y_{ij}$ to represent each sample in the canonical simplex.
The normalized source profiles $h_k^* = h_k / ( \sum_j h_{kj} )$ form the vertices of the convex polytope governing the data distribution.
The convex hull of the normalized data approximates this polytope, and the model estimates the $K$ vertices by finding the $K$ points that maximize the $(K-1)$ -dimensional volume within the convex hull.
Once $H^*$ is determined, the estimator for $\Phi$ is constructed using the sample means of $W_i$ , or equivalently, the population means under sufficient sample size.

The estimator's consistency (convergence to the true $\Phi$ ) holds under both independent and dependent emission processes (e.g., AR(1) models). The Hausdorff distance between the empirical convex hull and the true polytope vanishes as sample size grows.

3. Mathematical Properties and Scale Invariance

Unlike classical NMF, which is only defined up to rescaling by positive diagonal matrices, the Source Attribution Percentage Matrix is scale (and unit) invariant. For any positive diagonal scaling $D$ , $WH = (WD)(D^{-1}H)$ leaves $Y$ unchanged, yet changes $W$ and $H$ arbitrarily. In contrast, the ratio

$\phi_{kj} = \frac{ \mu_k H_{kj} }{ \sum_\ell \mu_\ell H_{\ell j} }$

remains unchanged—both numerator and denominator scale together—so the attribution percentages are stable and comparable even between studies with differing normalization or instrument calibrations.

4. Numerical Validation and Convergence

Simulation experiments, spanning both independent identically distributed and serially dependent $W_i$ emissions, confirm the geometric estimator's performance:

Normalized root mean squared error (NRMSE) and Frobenius norm distances between true and estimated $\Phi$ matrices decrease with increasing sample size $n$ , supporting consistency.
Scatter plots of estimated vs. true $\phi_{kj}$ entries exhibit points tightly clustered near the diagonal for moderate-to-large $n$ .
The maximum volume polytope algorithm robustly locates source vertices in the empirical simplex, leading to reliable source profile recovery and, via averaging, accurate estimation of $\Phi$ .

5. Practical Application and Interpretability

Practitioners obtain an immediately interpretable matrix $\Phi$ quantifying the proportion of each observed feature (pollutant) attributable to each source. This representation:

Enables targeted mitigation policies by identifying dominant sources for each pollutant.
Facilitates cross-paper and cross-site comparability even when measurement units or experimental designs differ.
Avoids pitfalls associated with arbitrary sparsity or normalization assumptions in traditional NMF, increasing robustness to model specification.
Informs regulatory frameworks and reporting standards for source apportionment, since the matrix provides actionable, scale-free attribution percentages.

6. Broader Implications and Extensions

The population-level, geometry-based identification strategy transcends the limitations of interpretable factorization (where factors are not uniquely defined and often tangled with normalization choices). By focusing on convex hull geometry and probabilistic separability, the framework is applicable to a broad class of problems—not merely atmospheric pollution, but any scenario requiring attribution of multivariate measurements to latent sources via NMF. The geometric estimator requires minimal parametric assumptions and accommodates temporal or spatial dependence in source emissions, further increasing its real-world utility.

This approach formalizes a rigorous, interpretable, and robust methodology for source attribution analysis and defines a new standard for reporting and comparing source contributions in multi-feature observational studies.

PDF Markdown Chat (Pro)

References (1)

Identification in source apportionment using geometry (2025)

Follow Topic

Get notified by email when new papers are published related to Source Attribution Percentage Matrix.