Source Attribution Percentage Matrix
- The Source Attribution Percentage Matrix is a quantitative framework that partitions multivariate pollutant data into source-specific percentage contributions using geometric estimators.
- It addresses classical NMF limitations by ensuring scale invariance and relaxing strict purity and sparsity assumptions through convex hull geometry.
- Numerical simulations confirm its consistency and accuracy in source recovery, making it a practical tool for air pollution control and regulatory assessment.
The Source Attribution Percentage Matrix is a quantitative framework devised for attributing observed multivariate signals—such as concentrations of air pollutants—to their contributing sources, resolving two major classical limitations in non-negative matrix factorization (NMF): non-uniqueness and the need for restrictive assumptions. By defining the estimand at the population level and constructing consistent geometric estimators, the matrix enables robust, interpretable partitioning of observed concentrations into percentages attributable to each source, while ensuring invariance to rescaling and relaxing classical NMF requirements (Jin et al., 4 Oct 2025).
1. Formal Definition and Identifiability
Let be the observed data matrix, with samples (e.g., time points) and features (e.g., pollutant concentrations). The standard NMF representation models where encodes the sample-level contributions (emissions) from sources and encodes the per-unit source profiles across pollutants.
The central construct, the Source Attribution Percentage Matrix , is defined elementwise as: where is the expected emission from source . Thus, each represents the population-level fraction of the concentration of pollutant attributable to source .
Key identifiability results show that is uniquely defined under two conditions:
- The emission process is stationary and ergodic (allowing empirical estimates of via sample averaging).
- The emission distribution is probabilistically separable—i.e., puts positive probability on regions in the emission simplex close to each canonical direction, ensuring near-pure source patterns can be observed without requiring strict 'pure pixel' or sparsity constraints.
The matrix remains invariant under arbitrary diagonal rescalings of and , unlike the factors themselves. This property is critical for the interpretation and comparison of source contributions across studies and measurement units.
2. Geometric Estimation Methodology
The estimation procedure exploits the conical geometry of the data induced by the NMF model:
- Row-normalize each sample: , where to represent each sample in the canonical simplex.
- The normalized source profiles form the vertices of the convex polytope governing the data distribution.
- The convex hull of the normalized data approximates this polytope, and the model estimates the vertices by finding the points that maximize the -dimensional volume within the convex hull.
- Once is determined, the estimator for is constructed using the sample means of , or equivalently, the population means under sufficient sample size.
The estimator's consistency (convergence to the true ) holds under both independent and dependent emission processes (e.g., AR(1) models). The Hausdorff distance between the empirical convex hull and the true polytope vanishes as sample size grows.
3. Mathematical Properties and Scale Invariance
Unlike classical NMF, which is only defined up to rescaling by positive diagonal matrices, the Source Attribution Percentage Matrix is scale (and unit) invariant. For any positive diagonal scaling , leaves unchanged, yet changes and arbitrarily. In contrast, the ratio
remains unchanged—both numerator and denominator scale together—so the attribution percentages are stable and comparable even between studies with differing normalization or instrument calibrations.
4. Numerical Validation and Convergence
Simulation experiments, spanning both independent identically distributed and serially dependent emissions, confirm the geometric estimator's performance:
- Normalized root mean squared error (NRMSE) and Frobenius norm distances between true and estimated matrices decrease with increasing sample size , supporting consistency.
- Scatter plots of estimated vs. true entries exhibit points tightly clustered near the diagonal for moderate-to-large .
- The maximum volume polytope algorithm robustly locates source vertices in the empirical simplex, leading to reliable source profile recovery and, via averaging, accurate estimation of .
5. Practical Application and Interpretability
Practitioners obtain an immediately interpretable matrix quantifying the proportion of each observed feature (pollutant) attributable to each source. This representation:
- Enables targeted mitigation policies by identifying dominant sources for each pollutant.
- Facilitates cross-paper and cross-site comparability even when measurement units or experimental designs differ.
- Avoids pitfalls associated with arbitrary sparsity or normalization assumptions in traditional NMF, increasing robustness to model specification.
- Informs regulatory frameworks and reporting standards for source apportionment, since the matrix provides actionable, scale-free attribution percentages.
6. Broader Implications and Extensions
The population-level, geometry-based identification strategy transcends the limitations of interpretable factorization (where factors are not uniquely defined and often tangled with normalization choices). By focusing on convex hull geometry and probabilistic separability, the framework is applicable to a broad class of problems—not merely atmospheric pollution, but any scenario requiring attribution of multivariate measurements to latent sources via NMF. The geometric estimator requires minimal parametric assumptions and accommodates temporal or spatial dependence in source emissions, further increasing its real-world utility.
This approach formalizes a rigorous, interpretable, and robust methodology for source attribution analysis and defines a new standard for reporting and comparing source contributions in multi-feature observational studies.