Nonnegative Matrix Factorization (NMF)
- Nonnegative Matrix Factorization is a technique that decomposes nonnegative data matrices into lower-rank, additive factors, revealing semantically meaningful components.
- It utilizes loss functions like the Frobenius norm and KL divergence, with PCC-based predictability for robust rank estimation and noise resistance.
- NMF supports applications in clustering, denoising, and dimensionality reduction by producing stable, sparse, and interpretable parts-based representations.
Nonnegative Matrix Factorization (NMF) refers to a class of algorithms that, given a data matrix with nonnegative entries, seek to approximate it as the product of two or more lower-rank nonnegative matrices. NMF is fundamentally nonconvex and NP-hard, yet it is widely employed in unsupervised learning, parts-based data representation, clustering, denoising, and dimensionality reduction. NMF’s interpretability arises from its nonnegativity constraints, which facilitate the decomposition of data into additive, often sparse and semantically meaningful components. Recent research has linked NMF to foundational concepts in probability and causality, such as the principle of the common cause, leading to advances in rank selection, stability analysis, clustering, and denoising (Khalafyan et al., 3 Sep 2025).
1. Formal Framework and Loss Functions
Given a nonnegative matrix , standard NMF seeks matrices and , with small inner dimension , such that
or, elementwise,
Regular objectives for fitting this decomposition include:
- Frobenius norm minimization:
- Kullback–Leibler (KL) divergence:
The KL objective, at its local minima, guarantees marginal conservation constraints: . No additional regularization terms were used in (Khalafyan et al., 3 Sep 2025); alternative formulations may include explicit sparsity or smoothness penalties.
2. Probabilistic Formulation: The Principle of the Common Cause
Interpreting (with normalization) as a joint probability 0, the NMF decomposition takes the form
1
This expresses the “independent mixture model,” wherein 2 indexes a latent “common cause” that statistically “screens off” dependency between 3 and 4: 5 Reichenbach’s principle of the common cause precisely corresponds to the existence of an exact NMF at the nonnegative rank 6.
3. Predictability and Effective Rank Estimation
Standard model selection procedures (e.g., BIC, RRSSQ) are typically used for determining the effective inner rank of NMF, but are susceptible to noise and lack clear optima in practice. By contrast, the PCC-inspired “predictability inequalities” define
7
for each pixel–image pair (8). The minimal 9 for which a prescribed small fraction of these inequalities are violated is adopted as an estimate for the effective NMF rank. Empirically, 0 yields a sharp transition, robust to weak noise. Example results for grayscale image matrices: | Dataset | 1 | |------------|-------------------| | Swimmer | 14 | | Olivetti | 26 | | UTKFace | 30 |
In contrast, information-theoretic criteria fail to converge to stable ranks under weak noise (Khalafyan et al., 3 Sep 2025).
4. Stability, Nonidentifiability, and "Sweet-Spot" of Solutions
NMF presents a fundamental nonidentifiability: multiple distinct decompositions may yield similar or identical loss. A solution is practically useful only if its components (basis images 2) are reproducible under noise and random initializations.
Stability is quantified by:
- Splitting the dataset, applying NMF with the same target rank and random seed, and matching basis images using the Hungarian algorithm on cosine distance:
3
- Observing that for 4, the set of basis images clusters tightly (average matched distance 5), even under 25% pixel noise. For 6 or 7, reproducibility degrades significantly.
This “sweet-spot” near 8 delineates the regime of locally unique, interpretable, and stable solutions (Khalafyan et al., 3 Sep 2025).
5. Clustering Interpretation via Approximate Mixture Model
In the “soft” PCC implemented by empirical NMF, the relative approximation error
9
is strongly anticorrelated with both the true marginal probability and correlatedness in the data: 0, 1. Thus, NMF more faithfully explains large and positively correlated events.
This property motivates a clustering strategy: for each basis 2, form a cluster
3
or select the top-4 images by 5. On face datasets, these clusters select visually and semantically coherent sets (e.g., images belonging to the same individual).
6. Denoising via NMF and Quantitative Performance
To denoise, one fits NMF to noisy data 6 and reconstructs a low-rank approximation. An image 7 is considered denoised if its cosine distance to the clean version is reduced: 8 With flip noise of 25% (Swimmer dataset), almost all images are improved across a wide range 9 (e.g., 0). NMF outperforms truncated PCA (60–70% denoising success vs. lower for PCA). On binarized data, PCA may have a slight edge, but both methods are much better than random performance.
7. Experimental Synthesis and Principal Findings
All results are supported by quantitative benchmarking on three canonical datasets (Swimmer, Olivetti faces, UTKFace) and a suite of metrics:
- Fraction of satisfied PCC inequalities (1 criterion)
- Mean internal cosine distance of basis images
- BIC1–3 and RRSSQ loss curves
- Clustering consistency
- Denoising percentage and reconstruction accuracy
Principal conclusions:
- PCC-based predictability provides a robust, noise-resistant estimator of effective NMF rank.
- NMF bases at 2 are highly consistent across noise/seed variation—resolving practical nonidentifiability.
- NMF naturally implements a “soft” mixture model (PCC), prioritizing large and likely correlations.
- NMF-derived clusters map to meaningful semantic groupings in images.
- NMF enables robust denoising, often outperforming PCA, over a broad range of ranks.
This mathematically connects classical NMF with foundational probabilistic-causal modeling concepts, advancing the methodology for principled rank selection, part stability, clustering, and denoising (Khalafyan et al., 3 Sep 2025).