Deep Semi-NMF: Hierarchical Matrix Factorization
- Deep Semi-NMF is a hierarchical multi-layer matrix factorization framework that decomposes high-dimensional data into interpretable, non-negative soft cluster memberships.
- It extends traditional Semi-NMF by stacking linear transformations to uncover nested feature hierarchies and reveal complex, overlapping attributes.
- The method employs greedy layer-wise pretraining followed by joint fine-tuning with multiplicative update rules to improve clustering accuracy and classification performance.
Deep Semi-Non-negative Matrix Factorization (Deep Semi-NMF) is a hierarchical matrix factorization framework designed to recover interpretable, multi-level attribute representations from high-dimensional data. Extending the concept of Semi-NMF to a deep, multi-layer architecture, it models the generative structure of data as a product of stacked linear transformations culminating in non-negative latent factors. Unlike classical flat matrix factorization, Deep Semi-NMF captures hierarchies of attributes, with each layer producing a non-negative feature matrix interpreted as soft cluster memberships for latent factors. This approach allows the uncovering of complex, nested structure in datasets, particularly when clustering or class labels reflect multiple, overlapping factors of variation (Trigeorgis et al., 2015).
1. Model Architecture
Given a data matrix , Deep Semi-NMF factorizes as:
where (with ) are stacked “basis” matrices with mixed signs and is a non-negative feature (“attribute”) matrix. Intermediate layers introduce feature matrices , each non-negative, yielding the layerwise structure: \begin{align*} X &\approx Z_1 H_1 \ H_1 &\approx Z_2 H_2 \ &\vdots \ H_{m-1} &\approx Z_m H_m \end{align*} Each is interpreted as a soft cluster-membership matrix for latent attributes (Trigeorgis et al., 2015).
2. Objective Functions and Constraints
For the unsupervised (purely linear) form, the Deep Semi-NMF objective is:
Optionally, non-negativity constraints can be enforced on all intermediate matrices.
A trace-based reformulation is also valid:
When partial attribute labels are available, the semi-supervised extension, Deep WSF, augments the objective with graph Laplacian regularizers:
where each is the Laplacian matrix built from available class labels for layer , and is a tuning hyperparameter (Trigeorgis et al., 2015).
3. Optimization Methods
Deep Semi-NMF utilizes a two-phase optimization scheme:
A. Greedy Layer-wise Pre-training:
Each layer solves a two-factor Semi-NMF on the output of the previous layer:
- For to , factor (with ).
- Alternate updates:
- , where denotes the Moore–Penrose inverse.
- is updated multiplicatively to ensure :
with , .
B. Joint Fine-Tuning:
After pre-training, all factors are updated via alternating minimization:
For each layer , define and (equal to if , else ).
-update (closed-form least squares):
- -update (multiplicative, preserves non-negativity):
- Repeat until convergence (Trigeorgis et al., 2015).
For Deep WSF, the -update includes -weighted Laplacian regularization terms in numerator and denominator, leveraging partial supervision.
4. Semi-Supervised Extension: Deep WSF
When partial attribute-label supervision is present, Deep WSF (Deep Weakly Supervised Factorization) incorporates a graph-based smoothness term into each layer’s factorization. For samples with known memberships in classes at layer , a similarity graph is constructed and its Laplacian (with diagonal) added to the loss as
The resulting optimization uses the same multiplicative update rule as in Deep Semi-NMF, but adds in the numerator and in the denominator, promoting smoother, label-consistent attribute representations. The pretraining step for each layer is switched from standard Semi-NMF to WSF to integrate available label information from the outset (Trigeorgis et al., 2015).
5. Algorithmic Summary and Computational Complexity
The training process is as follows:
Layer-wise Pre-training:
- Initialize , (e.g., SVD-based).
- For to , run Semi-NMF (or WSF if supervised) on .
- Persist factors , .
- Joint Fine-Tuning:
- Alternate and updates for each layer until convergence.
Per-iteration computational complexity is , with (Trigeorgis et al., 2015).
6. Empirical Results and Benchmarks
Deep Semi-NMF and Deep WSF have been validated on several standard face datasets:
| Dataset | Samples | Subjects | Attributes |
|---|---|---|---|
| XM2VTS | 2,360 | 295 | 8 images/subject |
| CMU PIE | 2,856 | 68 | 42 illuminations/poses |
| CMU Multi-PIE subset | 7,905 | 147 | 5 poses, 6 expressions |
Input features included raw pixels (all non-negative) and image-gradient–orientation (IGO) descriptors (mixed-sign). Baselines comprised NMF, Semi-NMF, GNMF, Multi-layer NMF, NeNMF, WSF, DNMF, and CNMF.
Performance metrics:
- Clustering: accuracy (AC), normalized mutual information (NMI), AUC of precision-recall.
- Downstream classification: linear SVM accuracy using learned .
Notable empirical findings:
- Two-layer Deep Semi-NMF outperformed all single-layer and multi-layer NMF baselines in clustering by up to 15% AC gain.
- IGO features yielded enhanced separation relative to Semi-NMF.
- Supervised pretraining (Deep WSF on XM2VTS initializing Deep on CMU PIE) improved clustering accuracy by +5–8%.
- On CMU Multi-PIE, Deep WSF’s per-layer attributes most accurately classified the corresponding ground-truth factors: pose, expression, identity, each at different layers (Trigeorgis et al., 2015).
7. Hierarchical Attribute Representation and Interpretability
Each non-negative matrix in the deep hierarchy can be interpreted as a soft clustering over latent factors, corresponding to different attributes in the data. In multi-attribute face datasets, empirical assessment shows:
- Layer 1 (largest ): broad separation (e.g., head-pose clusters).
- Layer 2 (medium ): refinement into expression groups.
- Layer 3 (small ): subject identity clusters.
Columns of each represent “basis portraits” or latent prototypes, with rows of indicating degrees of membership. Visualizing across layers reveals a staged “peeling away” of data variability: initial layers partition by high-variance attributes (e.g., pose), later layers resolve lower-variance ones (e.g., identity). This layered decomposition underwrites the method’s capacity to learn disentangled and attribute-aware representations (Trigeorgis et al., 2015).