Self-supervised Dimension Reduction

Updated 30 January 2026

Self-supervised dimension reduction is a technique that projects high-dimensional data into lower-dimensional spaces using reconstruction objectives derived directly from the feature matrix.
It leverages rank-k singular value decomposition to achieve optimal low-rank approximations, reducing computational complexity and mitigating overfitting in polynomial models.
Efficient optimization via conjugate gradient methods integrates the self-supervised embedding with downstream models, ensuring scalability and robust performance.

Self-supervised dimension reduction is a class of techniques designed to project data into lower-dimensional spaces without using labeled outcomes, relying instead on reconstructive or structure-preserving objectives derived directly from the feature matrix. Recent research has focused on scalable frameworks that combine linear algebraic methods with self-supervised objectives, demonstrating strong performance in domains where traditional dimensionality reduction methods may struggle due to the curse of dimensionality, overfitting, and computational inefficiency (Song et al., 18 Jan 2025).

1. Mathematical Formulation of Self-supervised Dimension Reduction

Given a training feature matrix $X \in \mathbb{R}^{m \times n}$ , where each row $x_p$ is an $n$ -dimensional input vector, the self-supervised dimensionality reduction task seeks a rank- $k$ linear transformation $T \in \mathbb{R}^{n \times n}$ (with $k < n$ ) to best reconstruct $X$ from its projection. The canonical objective optimizes the Frobenius norm reconstruction:

$\min_{T \in \mathbb{R}^{n \times n}} \|X T - X\|_F^2 \quad\text{s.t.}\quad \mathrm{rank}(T) = k.$

Here, $\|A\|_F^2 = \sum_{i, j} A_{ij}^2$ denotes the Frobenius norm. The absence of any supervised signal (i.e., labels $y$ ) in the embedding learning renders this approach strictly self-supervised. An optional regularization term $\frac{\lambda}{2}\|T\|_F^2$ may be included to control the magnitude of the transformation, but was omitted in core HOPS experiments ( $\lambda = 0$ ).

2. Low-rank Embedding and Interaction with Polynomial Models

The optimal solution to the stated objective is provided by the Eckart–Young–Mirsky theorem, which prescribes rank- $k$ approximation via singular value decomposition (SVD):

$X = U \Sigma V^T,$

where $V_k = [v_1, \ldots, v_k]$ collects the top- $k$ right singular vectors. The optimal transformation is $T^* = V_k V_k^T$ , yielding a reduced feature matrix $\tilde{X} = X V_k$ .

In the context of high-order polynomial modeling, each original input $x$ is mapped to $\tilde{x} = x V_k \in \mathbb{R}^k$ . For a $d$ -order polynomial model,

$f(x) = W_0 + \sum_{i=1}^d \sum_{j_1, \ldots, j_i=1}^n (W_i)_{j_1\cdots j_i} x_{j_1} \cdots x_{j_i},$

the parameter count scales as $\mathcal{O}(n^d)$ . Embedding reduces this to an equivalent $k$ -dimensional polynomial

$\tilde{f}(x) = W_0 + \sum_{i=1}^d \sum_{a_1, \ldots, a_i=1}^k (W'_i)_{a_1\cdots a_i} (\tilde{x})_{a_1} \cdots (\tilde{x})_{a_i}$

with $\mathcal{O}(k^d)$ parameters per $W'_i$ , dramatically reducing complexity and mitigating overfitting (Song et al., 18 Jan 2025).

3. Computational Optimization via Conjugate Gradient

Fitting polynomial models on the embedded features leads to a large-scale linear least squares problem. Rather than explicitly solving via matrix inversion, HOPS leverages a Fletcher–Reeves style Conjugate Gradient (PolyCG) algorithm. In each iteration, weights $W_i$ are updated by:

$W_i^{(j+1)} = W_i^{(j)} + \alpha^{(j)} P_i^{(j)}$

$P_i^{(j+1)} = -R_i^{(j+1)} + \beta^{(j)} P_i^{(j)}$

where $R_i^{(j)}$ is the gradient of the empirical loss with respect to $W_i$ , and the iterative step sizes $\alpha^{(j)}$ and $\beta^{(j)}$ are computed as ratios of gradient norms. Convergence is assessed via relative change in loss, and the approach incurs per-iteration computational cost of $O(m \sum_{i=1}^d k_i^i)$ for evaluation and gradient computation. In exact arithmetic, CG converges in at most $\sum_i k_i^i$ steps but typically achieves solution in far fewer iterations.

4. Feature Construction and Downstream Integration

Feature construction proceeds by:

Computing embedding matrix $L = V_k$ via top- $k$ right singular vectors of $X$ .
Mapping each original sample $x$ to compressed $\tilde{x} = x L$ .
Expanding $\tilde{x}$ into polynomial features up to degree $d$ .
Fitting $\tilde{f}(\tilde{x})$ via PolyCG to optimize prediction on $(\tilde{x}, y)$ pairs.

The embedding is learned with no label information; only subsequent polynomial fitting employs supervised learning. This single-pass embedding ensures seamless integration with downstream supervised models and reduces the risk of information bottleneck arising from two-stage processes.

5. Theoretical Properties and Guarantees

The rank- $k$ SVD embedding provides the best possible low-rank approximation in Frobenius norm, as established by Eckart–Young. Principal information preservation is guaranteed for directions of maximal variance across the training set. The reduction in dimension from $n$ to $k$ translates into parameter savings for polynomial models, from $\Theta(n^d)$ to $\Theta(k^d)$ , directly addressing overfitting and resource constraints. In addition, the conjugate gradient solver for least squares exhibits theoretical guarantees of convergence in a number of steps not exceeding the ambient parameter dimensionality, with robust empirical behavior under numerical conditions (Song et al., 18 Jan 2025).

6. Hyperparameter Selection and Practical Guidelines

The main hyperparameters are:

Polynomial order $d$ : empirical evidence suggests $d \leq 3$ suffices for applications such as load forecasting.
Embedding ranks $\{k_i\}$ : $k_1 = n$ , with descending ranks for higher polynomial orders (e.g., $k_2$ and $k_3$ are set via grid search or fixed at values such as $(60, 9)$ ).
Optional regularization $\lambda$ on $T$ or $L_2$ penalties on $W_i$ can be tuned via cross-validation for bias–variance tradeoff.

7. Comparison to PCA and Autoencoder Methods

If $X$ is z-score normalized, the top- $k$ SVD embedding coincides with principal components analysis (PCA); HOPS generalizes by allowing SVD on unnormalized or min-max normalized inputs. Unlike autoencoders, the embedding admits a closed-form solution with no need for iterative network training or risk of vanishing gradients. The self-supervised nature of the embedding prevents two-stage learning mismatch and, compared with generic nonlinear manifold methods, the linear SVD projection is computationally efficient, numerically stable, and theoretically optimal for reconstructive error.

In summary, self-supervised dimension reduction as realized in the HOPS framework provides a principled, efficient approach for embedding high-dimensional data. Its integration with polynomial models enables scalable, robust learning for complex forecasting tasks, with strong theoretical underpinnings and practical performance advantages (Song et al., 18 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HOPS: High-order Polynomials with Self-supervised Dimension Reduction for Load Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-supervised Dimension Reduction.