Deep Semi-NMF: Hierarchical Matrix Factorization
- Deep semi-NMF is a multilayer matrix factorization technique that integrates nonnegativity constraints to uncover interpretable, hierarchical latent representations.
- It employs layer stacking, elementwise nonlinearity, and alternating minimization to optimize representations for tasks like clustering, classification, and collaborative filtering.
- Empirical results demonstrate improved accuracy and robustness over shallow models in applications such as image analysis, recommender systems, and neural network weight estimation.
Deep semi-nonnegative matrix factorization (Deep semi-NMF) generalizes convex and semi-convex matrix factorization to deep or multilayer settings, incorporating hierarchical representation learning under constraints of (partial) non-negativity. Deep semi-NMF enables the discovery of interpretable and discriminative latent structures across multiple abstraction levels in data, with applications in clustering, classification, collaborative filtering, and neural network weight estimation. Layer stacking, elementwise nonlinearity, and alternating optimization define typical methodological frameworks.
1. Formulation and Theoretical Foundations
Deep semi-NMF extends shallow semi-NMF, which factorizes a real matrix as with unconstrained and nonnegative. The deep (multilayer) formulation is:
where , , ..., , and . Intermediate matrices (for 0) are defined recursively: 1 This structure imposes nonnegativity on the final layer's representations, with unconstrained factors otherwise, promoting “soft” cluster assignments at each level (Trigeorgis et al., 2015).
Nonlinearity can be interleaved: 2 where 3, e.g. ReLU or softplus, is applied elementwise (Krishna et al., 2017).
In supervised deep semi-NMF for deep neural networks, the model augments this decomposition: 4 with 5, 6 typically ReLU. The first term is a semi-NMF, subsequent summands are nonlinear semi-NMFs (Sakurai et al., 2016).
2. Optimization Algorithms and Layerwise Procedures
Alternating minimization (“block-coordinate descent”) is fundamental. Greedy layerwise pre-training initializes 7 by sequential (semi-)NMF or NMF on representations 8 (Trigeorgis et al., 2015). Fine-tuning proceeds by alternating updates over all factor matrices:
For the linear case, 9 is solvable in closed form: 0 where 1, 2 is the partial right product, and 3 denotes the Moore–Penrose pseudoinverse.
4 update (multiplicative for linear case): 5
with 6 and 7 (Trigeorgis et al., 2015).
For nonlinear objectives or activation functions, projected/stationary gradient iterations are used, maintaining 8. In supervised deep semi-NMF, stationary iteration updates align model weights with the nonlinear layerwise prediction (e.g., 9) and projection ensures nonnegativity of 0 (Sakurai et al., 2016).
Autoencoder-style (pre-)training is often used for initialization: standard NMF on each layer's output with subsequent nonlinear encoder fit (Sakurai et al., 2016).
3. Model Variants and Nonlinearity
Nonlinear deep semi-NMF structures insert elementwise nonlinearities between factorizing linear layers. In collaborative filtering, deep semi-NMF can apply nonlinearity only on the item side: 1 Yet the “interaction” remains linear in the topmost non-negative features. Stacking more than two nonlinear layers has been empirically shown to increase test RMSE, indicating diminishing returns for depth (Krishna et al., 2017).
In deep neural network weight estimation, the nonlinear semi-NMF paradigm is leveraged for end-to-end layer-wise weight learning without explicit gradient backpropagation, using alternating minimization for both semi-NMF and nonlinear semi-NMF objectives (Sakurai et al., 2016).
The semi-supervised extension incorporates attribute information via Laplacian regularizers, enabling label propagation and enforcing smoothness within each 2 based on partially known attributes (“Deep WSF”) (Trigeorgis et al., 2015).
4. Empirical Performance and Applications
Clustering and Classification
Deep semi-NMF demonstrates strong clustering accuracy (AC), normalized mutual information (NMI), and SVM-based classification accuracy, consistently outperforming one-layer NMF/Semi-NMF, graph-regularized NMFs, and classical multi-layer NMF on standard face datasets (CMU PIE, XM2VTS, Multi-PIE), with gains (e.g., ACM2 PIE: Deep Semi-NMF ~54% vs Semi-NMF ~50% AC) confirmed across both pixel and image-gradient features (Trigeorgis et al., 2015).
Deep WSF further improves clustering/classification via multi-attribute Laplacian regularization, achieving significant advances in attribute-specific accuracies (e.g., pose, expression, identity) (Trigeorgis et al., 2015).
Deep Neural Networks
Deep semi-NMF exhibits supervised performance within 0.1–0.2% of backpropagation-trained neural networks for MNIST and CIFAR-10 classification, with convergence times per epoch comparable (within 5–10%) to standard BP frameworks and improved robustness to hyperparameter (e.g., step-size) settings (Sakurai et al., 2016).
Collaborative Filtering
In recommender systems, nonlinear deep semi-NMF attains lower RMSE than both shallow NMF and more complex deep matrix factorization approaches. For FilmTrust, MovieLens 100K, and Amazon Music, NSNMF (ReLU + bias) achieved the lowest RMSE in each case, e.g., 0.788 (FilmTrust), 0.887 (MovieLens 100K), and 0.836 (Amazon Music). NSNMF's item clustering quality (WCSS) is on par with deep MF models (Krishna et al., 2017).
5. Interpretability, Hierarchical Structure, and Attribute Discovery
Each 3 in deep semi-NMF corresponds to a non-negative, interpretable latent representation—akin to soft-clustering—at abstraction level 4. This hierarchical modeling enables unsupervised discovery of data attributes such as pose, expression, or identity in face datasets, with the hierarchy empirically matching “known” attributes when available (Trigeorgis et al., 2015). Layer stacking combined with nonnegativity endows final features with increased discriminativeness and interpretability relative to classical deep autoencoders, where both sides of the factorization may be unconstrained (Krishna et al., 2017).
In collaborative filtering, the restriction to item-side depth (while user features remain linear) preserves identifiability of item factors and reduces overfitting risk, with interpretability retained in top-layer nonnegative representations (Krishna et al., 2017).
6. Methodological Comparisons and Limitations
Relative to deep MF models with unconstrained representations (e.g., deep autoencoders, multilayer perceptrons), deep semi-NMF restricts nonnegativity to specific matrix blocks, maintaining linearity in the outer product and allowing “part-based” or clustering interpretation at the final layer or after each intermediate step. Deep semi-NMF differs from deep NMF approaches by its ability to handle mixed-sign input data and retain soft cluster-weights interpretation.
The effectiveness of deep semi-NMF is contingent upon suitable selection of depth (5) and layer widths 6, which remain heuristic. Training complexity increases with layer count, and stacking beyond two layers provides diminishing or negative returns in standard tasks (notably in collaborative filtering) (Krishna et al., 2017). Layerwise greedy pretraining is critical in navigating the highly nonconvex loss landscape; without it, final representations may underperform. Laplacian regularization for semi-supervised deep semi-NMF improves attribute-disentanglement but introduces additional hyperparameters (Trigeorgis et al., 2015).
7. Future Directions
Extensions under investigation include multilinear/tensor generalizations, application to heterogeneous modalities (e.g., speech), and the development of more advanced nonconvex optimization algorithms to manage the scalability and depth of deep semi-NMF architectures. Further analysis of nonnegativity's inductive bias at deep layers, as well as its utility in domains beyond vision and recommendation, presents an open research direction (Trigeorgis et al., 2015, Sakurai et al., 2016, Krishna et al., 2017).