Self-supervised Dimension Reduction
- Self-supervised dimension reduction is a technique that projects high-dimensional data into lower-dimensional spaces using reconstruction objectives derived directly from the feature matrix.
- It leverages rank-k singular value decomposition to achieve optimal low-rank approximations, reducing computational complexity and mitigating overfitting in polynomial models.
- Efficient optimization via conjugate gradient methods integrates the self-supervised embedding with downstream models, ensuring scalability and robust performance.
Self-supervised dimension reduction is a class of techniques designed to project data into lower-dimensional spaces without using labeled outcomes, relying instead on reconstructive or structure-preserving objectives derived directly from the feature matrix. Recent research has focused on scalable frameworks that combine linear algebraic methods with self-supervised objectives, demonstrating strong performance in domains where traditional dimensionality reduction methods may struggle due to the curse of dimensionality, overfitting, and computational inefficiency (Song et al., 18 Jan 2025).
1. Mathematical Formulation of Self-supervised Dimension Reduction
Given a training feature matrix , where each row is an -dimensional input vector, the self-supervised dimensionality reduction task seeks a rank- linear transformation (with ) to best reconstruct from its projection. The canonical objective optimizes the Frobenius norm reconstruction:
Here, denotes the Frobenius norm. The absence of any supervised signal (i.e., labels ) in the embedding learning renders this approach strictly self-supervised. An optional regularization term may be included to control the magnitude of the transformation, but was omitted in core HOPS experiments ().
2. Low-rank Embedding and Interaction with Polynomial Models
The optimal solution to the stated objective is provided by the Eckart–Young–Mirsky theorem, which prescribes rank- approximation via singular value decomposition (SVD):
where collects the top- right singular vectors. The optimal transformation is , yielding a reduced feature matrix .
In the context of high-order polynomial modeling, each original input is mapped to . For a -order polynomial model,
the parameter count scales as . Embedding reduces this to an equivalent -dimensional polynomial
with parameters per , dramatically reducing complexity and mitigating overfitting (Song et al., 18 Jan 2025).
3. Computational Optimization via Conjugate Gradient
Fitting polynomial models on the embedded features leads to a large-scale linear least squares problem. Rather than explicitly solving via matrix inversion, HOPS leverages a Fletcher–Reeves style Conjugate Gradient (PolyCG) algorithm. In each iteration, weights are updated by:
where is the gradient of the empirical loss with respect to , and the iterative step sizes and are computed as ratios of gradient norms. Convergence is assessed via relative change in loss, and the approach incurs per-iteration computational cost of for evaluation and gradient computation. In exact arithmetic, CG converges in at most steps but typically achieves solution in far fewer iterations.
4. Feature Construction and Downstream Integration
Feature construction proceeds by:
- Computing embedding matrix via top- right singular vectors of .
- Mapping each original sample to compressed .
- Expanding into polynomial features up to degree .
- Fitting via PolyCG to optimize prediction on pairs.
The embedding is learned with no label information; only subsequent polynomial fitting employs supervised learning. This single-pass embedding ensures seamless integration with downstream supervised models and reduces the risk of information bottleneck arising from two-stage processes.
5. Theoretical Properties and Guarantees
The rank- SVD embedding provides the best possible low-rank approximation in Frobenius norm, as established by Eckart–Young. Principal information preservation is guaranteed for directions of maximal variance across the training set. The reduction in dimension from to translates into parameter savings for polynomial models, from to , directly addressing overfitting and resource constraints. In addition, the conjugate gradient solver for least squares exhibits theoretical guarantees of convergence in a number of steps not exceeding the ambient parameter dimensionality, with robust empirical behavior under numerical conditions (Song et al., 18 Jan 2025).
6. Hyperparameter Selection and Practical Guidelines
The main hyperparameters are:
- Polynomial order : empirical evidence suggests suffices for applications such as load forecasting.
- Embedding ranks : , with descending ranks for higher polynomial orders (e.g., and are set via grid search or fixed at values such as ).
- Optional regularization on or penalties on can be tuned via cross-validation for bias–variance tradeoff.
7. Comparison to PCA and Autoencoder Methods
If is z-score normalized, the top- SVD embedding coincides with principal components analysis (PCA); HOPS generalizes by allowing SVD on unnormalized or min-max normalized inputs. Unlike autoencoders, the embedding admits a closed-form solution with no need for iterative network training or risk of vanishing gradients. The self-supervised nature of the embedding prevents two-stage learning mismatch and, compared with generic nonlinear manifold methods, the linear SVD projection is computationally efficient, numerically stable, and theoretically optimal for reconstructive error.
In summary, self-supervised dimension reduction as realized in the HOPS framework provides a principled, efficient approach for embedding high-dimensional data. Its integration with polynomial models enables scalable, robust learning for complex forecasting tasks, with strong theoretical underpinnings and practical performance advantages (Song et al., 18 Jan 2025).