Deep WSF: Weakly-Supervised Deep Semi-NMF
- The paper demonstrates that integrating weak supervision into deep semi-NMF frameworks enhances representation learning and improves clustering and classification accuracy.
- The methodology involves multi-layer factorization with graph Laplacian regularization, enabling hierarchical learning and encoding of partial labels.
- Empirical results show significant gains over standard Semi-NMF, validating its effectiveness on multi-attribute datasets like facial recognition.
Weakly-Supervised Deep Semi-NMF (Deep WSF) is a multi-layer matrix factorization framework designed to learn hierarchical and attribute-specific representations from partially labeled data. This paradigm builds on the classical Semi-Nonnegative Matrix Factorization (Semi-NMF) and incorporates partial prior information using weak supervision, extending factorization into multiple nonnegative layers. Deep WSF simultaneously enables unsupervised learning of hidden semantic features and explicit encoding of available label information, supporting both clustering and classification across complex, multi-attribute datasets (Trigeorgis et al., 2015).
1. Model Formulation and Layered Factorization
At its core, Deep WSF operates on a data matrix , where is the feature dimension and is the number of samples. The factorization proceeds through layers:
- : linear transformation (basis) matrices at layer , with and the dimension of the final code.
- : activation (code) matrices; a nonnegativity constraint 0 is imposed at all layers.
Each intermediate layer reconstructs its input as 1, giving a hierarchical structure analogous to deep networks. Optional entrywise nonlinearities 2 can be inserted, making 3, but the principal model is linear in the cited work.
For weak supervision, Deep WSF is equipped to integrate partial label information on one or more attributes at each layer, modeled via graph Laplacian regularization (Trigeorgis et al., 2015).
2. Objective Function and Regularization
The full optimization criterion is
4
Subject to: 5 for all 6.
- Reconstruction Loss: 7 enforces a compact encoding for 8.
- Graph Laplacian Regularization: Each 9 is a Laplacian derived from partial labels or attribute similarity graphs at layer 0; it penalizes divergence of low-dimensional codes 1 for samples known (or presumed) to share an attribute.
If a nonlinearity 2 is used, the reconstruction loss generalizes accordingly. The regularization parameter 3 controls the influence of supervision at each layer and is typically selected in the range 4.
3. Weak Supervision via Attribute Graphs
At each layer, weak supervision is achieved by constructing an adjacency graph 5 reflecting known partial labels for the relevant attribute. The Laplacian is then 6, where 7 is the corresponding degree matrix.
- 8 if samples 9 and 0 share a known label for the supervised attribute at layer 1, and 2 otherwise.
- The penalty 3 becomes a sum over 4, encouraging codes to cluster for must-linked items.
In datasets with multiple known attributes (e.g., identity, pose, expression), Deep WSF can be configured so that each layer encodes a representation specialized for one attribute, with separate graphs and Laplacians at each level.
4. Optimization Algorithm and Practicalities
Training proceeds in two stages:
- Greedy Layerwise Pre-training: For each layer 5, optimize the single-layer WSF subproblem:
6
using multiplicative updates for 7 and least-squares or pseudo-inverse for 8.
- Global Fine-tuning: Alternately update all 9 by:
- 0 least-squares solution from reconstructed code.
- 1 via component-wise multiplicative update:
2
where 3, 4, and 5. - Optionally, renormalize to keep 6 bounded.
The stopping criterion is typically a small relative objective change or a fixed iteration count; settings of 500–1,000 iterations are reported. Initialization is typically via SVD-based heuristics (NNDSVD or Gillis–Glineur).
5. Empirical Evaluation and Attribute Decoupling
In experimental settings on face datasets (XM2VTS, CMU-PIE, CMU-Multi-PIE), Deep WSF demonstrates statistically significant improvements in clustering accuracy (AC) and classification against Semi-NMF and alternative nonnegative or semi-supervised matrix factorization methods.
- For example, on XM2VTS (final layer dimension 40), Semi-NMF achieves AC70.61, while Deep Semi-NMF yields AC80.68. Using Image Gradient Orientation features, Deep models reach AC90.77 (vs. Semi-NMF 0.63) (Trigeorgis et al., 2015).
- In the three-attribute classification on CMU-Multi-PIE, Deep WSF learns 0 optimized for pose, expression, and identity, outperforming all previous semi-supervised NMF variants on identity classification by 110%, with attribute-specific accuracies of 100%, 82.9%, and 65.2%, respectively.
- Supervised pre-training on one dataset can transfer beneficially to another, as shown by AC improvements from 0.56 to 0.62 on CMU-PIE after pre-training on XM2VTS.
6. Computational Complexity and Implementation Guidelines
Reported computational complexity for Deep WSF (in the linear model) is:
- Pre-training: 2
- Fine-tuning: 3 where 4, and 5, 6 are the number of iterations for pre-training and fine-tuning, respectively.
Key guidelines:
- Number of layers 7 typically set to 2 or 3.
- Hidden sizes 8 are dataset- and attribute-dependent, e.g., 9, 0–70.
- Regularization parameters 1 tuned via validation, recommended in 2 for partial supervision.
- Careful initialization is critical for convergence and stability.
7. Significance and Distinct Features
Deep WSF enables learning of deep, layered representations that are explicitly aligned with weak, attribute-level supervision, providing a principled methodology for capturing both global and attribute-specific structure in complex datasets (Trigeorgis et al., 2015). Its layerwise Laplacian regularization fosters disentanglement along known axes of variability, in contrast to flat NMF methods. Empirical results show superior clustering and classification, with robustness to mixed or partial labels. Deep WSF also supports multi-attribute learning, yielding layerwise representations specialized for each attribute, a capability not present in conventional shallow factorization frameworks.
A plausible implication is that Deep WSF stands as a foundation for future research on deep factorizations with multi-attribute or graph-based weak supervision, and it continues to inform more modern deep semi-NMF frameworks employing more advanced prior or label constraints (Zhang et al., 2020, Trigeorgis et al., 2015).