Deep WSF: Weakly-Supervised Deep Semi-NMF

Updated 20 April 2026

The paper demonstrates that integrating weak supervision into deep semi-NMF frameworks enhances representation learning and improves clustering and classification accuracy.
The methodology involves multi-layer factorization with graph Laplacian regularization, enabling hierarchical learning and encoding of partial labels.
Empirical results show significant gains over standard Semi-NMF, validating its effectiveness on multi-attribute datasets like facial recognition.

Weakly-Supervised Deep Semi-NMF (Deep WSF) is a multi-layer matrix factorization framework designed to learn hierarchical and attribute-specific representations from partially labeled data. This paradigm builds on the classical Semi-Nonnegative Matrix Factorization (Semi-NMF) and incorporates partial prior information using weak supervision, extending factorization into multiple nonnegative layers. Deep WSF simultaneously enables unsupervised learning of hidden semantic features and explicit encoding of available label information, supporting both clustering and classification across complex, multi-attribute datasets (Trigeorgis et al., 2015).

1. Model Formulation and Layered Factorization

At its core, Deep WSF operates on a data matrix $X \in \mathbb{R}^{p \times n}$ , where $p$ is the feature dimension and $n$ is the number of samples. The factorization proceeds through $m$ layers:

$X \approx Z_1 Z_2 \cdots Z_m H_m$

$Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ : linear transformation (basis) matrices at layer $i$ , with $k_0 = p$ and $k_m$ the dimension of the final code.
$H_i \in \mathbb{R}^{k_i \times n}$ : activation (code) matrices; a nonnegativity constraint $p$ 0 is imposed at all layers.

Each intermediate layer reconstructs its input as $p$ 1, giving a hierarchical structure analogous to deep networks. Optional entrywise nonlinearities $p$ 2 can be inserted, making $p$ 3, but the principal model is linear in the cited work.

For weak supervision, Deep WSF is equipped to integrate partial label information on one or more attributes at each layer, modeled via graph Laplacian regularization (Trigeorgis et al., 2015).

2. Objective Function and Regularization

The full optimization criterion is

$p$ 4

Subject to: $p$ 5 for all $p$ 6.

Reconstruction Loss: $p$ 7 enforces a compact encoding for $p$ 8.
Graph Laplacian Regularization: Each $p$ 9 is a Laplacian derived from partial labels or attribute similarity graphs at layer $n$ 0; it penalizes divergence of low-dimensional codes $n$ 1 for samples known (or presumed) to share an attribute.

If a nonlinearity $n$ 2 is used, the reconstruction loss generalizes accordingly. The regularization parameter $n$ 3 controls the influence of supervision at each layer and is typically selected in the range $n$ 4.

3. Weak Supervision via Attribute Graphs

At each layer, weak supervision is achieved by constructing an adjacency graph $n$ 5 reflecting known partial labels for the relevant attribute. The Laplacian is then $n$ 6, where $n$ 7 is the corresponding degree matrix.

$n$ 8 if samples $n$ 9 and $m$ 0 share a known label for the supervised attribute at layer $m$ 1, and $m$ 2 otherwise.
The penalty $m$ 3 becomes a sum over $m$ 4, encouraging codes to cluster for must-linked items.

In datasets with multiple known attributes (e.g., identity, pose, expression), Deep WSF can be configured so that each layer encodes a representation specialized for one attribute, with separate graphs and Laplacians at each level.

4. Optimization Algorithm and Practicalities

Training proceeds in two stages:

Greedy Layerwise Pre-training: For each layer $m$ 5, optimize the single-layer WSF subproblem:

$m$ 6

using multiplicative updates for $m$ 7 and least-squares or pseudo-inverse for $m$ 8.

Global Fine-tuning: Alternately update all $m$ $m$ 9 by:
- $X \approx Z_1 Z_2 \cdots Z_m H_m$ 0 least-squares solution from reconstructed code.
- $X \approx Z_1 Z_2 \cdots Z_m H_m$ 1 via component-wise multiplicative update:
$X \approx Z_1 Z_2 \cdots Z_m H_m$ 2

where $X \approx Z_1 Z_2 \cdots Z_m H_m$ 3, $X \approx Z_1 Z_2 \cdots Z_m H_m$ 4, and $X \approx Z_1 Z_2 \cdots Z_m H_m$ 5. - Optionally, renormalize to keep $X \approx Z_1 Z_2 \cdots Z_m H_m$ 6 bounded.

The stopping criterion is typically a small relative objective change or a fixed iteration count; settings of 500–1,000 iterations are reported. Initialization is typically via SVD-based heuristics (NNDSVD or Gillis–Glineur).

5. Empirical Evaluation and Attribute Decoupling

In experimental settings on face datasets (XM2VTS, CMU-PIE, CMU-Multi-PIE), Deep WSF demonstrates statistically significant improvements in clustering accuracy (AC) and classification against Semi-NMF and alternative nonnegative or semi-supervised matrix factorization methods.

For example, on XM2VTS (final layer dimension 40), Semi-NMF achieves AC $X \approx Z_1 Z_2 \cdots Z_m H_m$ 70.61, while Deep Semi-NMF yields AC $X \approx Z_1 Z_2 \cdots Z_m H_m$ 80.68. Using Image Gradient Orientation features, Deep models reach AC $X \approx Z_1 Z_2 \cdots Z_m H_m$ 90.77 (vs. Semi-NMF 0.63) (Trigeorgis et al., 2015).
In the three-attribute classification on CMU-Multi-PIE, Deep WSF learns $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 0 optimized for pose, expression, and identity, outperforming all previous semi-supervised NMF variants on identity classification by $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 110%, with attribute-specific accuracies of 100%, 82.9%, and 65.2%, respectively.
Supervised pre-training on one dataset can transfer beneficially to another, as shown by AC improvements from 0.56 to 0.62 on CMU-PIE after pre-training on XM2VTS.

6. Computational Complexity and Implementation Guidelines

Reported computational complexity for Deep WSF (in the linear model) is:

Pre-training: $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 2
Fine-tuning: $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 3 where $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 4, and $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 5, $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 6 are the number of iterations for pre-training and fine-tuning, respectively.

Key guidelines:

Number of layers $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 7 typically set to 2 or 3.
Hidden sizes $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 8 are dataset- and attribute-dependent, e.g., $Z_i \in \mathbb{R}^{k_{i-1} \times k_i}$ 9, $i$ 0–70.
Regularization parameters $i$ 1 tuned via validation, recommended in $i$ 2 for partial supervision.
Careful initialization is critical for convergence and stability.

7. Significance and Distinct Features

Deep WSF enables learning of deep, layered representations that are explicitly aligned with weak, attribute-level supervision, providing a principled methodology for capturing both global and attribute-specific structure in complex datasets (Trigeorgis et al., 2015). Its layerwise Laplacian regularization fosters disentanglement along known axes of variability, in contrast to flat NMF methods. Empirical results show superior clustering and classification, with robustness to mixed or partial labels. Deep WSF also supports multi-attribute learning, yielding layerwise representations specialized for each attribute, a capability not present in conventional shallow factorization frameworks.

A plausible implication is that Deep WSF stands as a foundation for future research on deep factorizations with multi-attribute or graph-based weak supervision, and it continues to inform more modern deep semi-NMF frameworks employing more advanced prior or label constraints (Zhang et al., 2020, Trigeorgis et al., 2015).

Markdown Report Issue Upgrade to Chat

References (2)

A deep matrix factorization method for learning attribute representations (2015)

Dual-constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weakly-Supervised Deep Semi-NMF (Deep WSF).