Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

Published 17 Dec 2025 in stat.ML and cs.LG | (2512.15684v1)

Abstract: Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.

Summary

  • The paper presents a rigorous spectral theory for PLS-SVD in high dimensions, quantifying phase transition thresholds for detecting joint versus individual signals.
  • It employs spiked random matrix theory to derive deterministic equivalents for the resolvent and to explain systematic eigenvector misalignment under noise.
  • The analysis demonstrates that PLS-SVD outperforms separate PCA by reliably recovering shared signal structures even in challenging noise regimes.

High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

Overview and Motivation

Partial Least Squares (PLS) provides a foundational approach for extracting shared latent structures across paired high-dimensional datasets, with critical utility for data integration in genomics, chemometrics, and other domains. This paper investigates PLS, and specifically the singular value decomposition variant (PLS-SVD), under a high-dimensional regime where both the predictor and response matrices possess large ambient dimension and limited sample size. While PLS remains heavily used in practice, especially in settings with small nn, the theoretical characterization of its spectral properties and limitations in modern high-dimensional contexts has remained lacking. The authors present a rigorous spectral analysis using spiked random matrix theory, focusing on the alignment of singular vectors and the detectability of signal components, with particular attention to distinguishing shared (joint) signals, individual-specific components, and noise.

Model and Mathematical Framework

The paper formalizes a signal-plus-noise model for two data matrices XRn×pX \in \mathbb{R}^{n \times p} and YRn×qY \in \mathbb{R}^{n \times q}, each decomposed into joint signal, individual-specific signal, and Gaussian noise:

X=JCP+IM+E,Y=JCR+IN+FX = JC_P + I_M + E, \qquad Y = JC_R + I_N + F

where JJ carries common latent scores, CPC_P and CRC_R encode loadings for predictors and responses, IMI_M and INI_N are individual low-rank structures specific to each modality, and EE, FF are noise matrices with i.i.d. entries. Orthogonality conditions ensure identifiability and prevent overlap in signal components. The analysis operates under a high-dimensional asymptotic regime (n,p,qn,p,q \rightarrow \infty with fixed ratios), and employs random matrix theoretic tools—specifically deterministic equivalents of kernel resolvents and advanced spectral perturbation methods—to examine the empirical cross-covariance matrix S=1pqXYS = \frac{1}{\sqrt{pq}} X^\top Y.

Limiting Spectral Distributions and Bulk Behavior

The first main result establishes the deterministic equivalent for the resolvent of SS, permitting explicit calculation of the limiting spectral distribution of the squared singular values (Figure 1). The Stieltjes transform associated with this bulk law is shown to satisfy a cubic polynomial equation, generalizing earlier Marčenko–Pastur and BBP-type results for spiked covariance models in the setting of paired datasets. Importantly, the empirical spectrum is tightly confined to the predicted support under pure noise, providing a rigorous basis for subsequent detection of isolated spikes corresponding to signal components. Figure 1

Figure 1: Empirical distribution of the squared singular values of SS together with the limiting spectral distribution predicted by the proposed random matrix theory.

Phase Transitions and Spike Detection

A central contribution is the explicit characterization of phase transition thresholds: for both individual and joint components, isolated singular values (‘spikes’) emerge from the bulk spectrum only when the signal strength exceeds a critical value τ\tau determined as the largest positive root of a cubic polynomial in the signal-to-noise parameters. Analytical expressions for the asymptotic spike locations are derived, elucidating their bias and dependence on signal regime. The presence of individual-specific components is shown to induce isolated singular values, but these are systematically non-informative—they do not align with any deterministic signal direction beyond their generator (Figure 2 and Figure 3). Figure 2

Figure 2: Empirical distribution of the squared singular values of SS, including the spikes generated by individual-specific components, alongside limiting spike locations.

Figure 3

Figure 3: Empirical means of the top singular vectors due to individual components that do not align on any deterministic signal direction.

Eigenvector Alignment and Limitations

The eigenvector analysis reveals nuanced and technically important misalignments. The singular vectors associated with the joint (shared) signal components do not asymptotically recover the true latent directions; rather, their alignment is with ‘skewed’ versions, systematically distorted by noise even in infinite sample limits except for particular spectral configurations (Figure 4 and Figure 5). This skewing prevents direct signal recovery and demonstrates a persistent limitation: PLS-SVD does not yield optimal reconstruction of joint latent structure except in degenerate cases. Figure 4

Figure 4: Empirical distribution of the squared singular values of SS together with theoretical spike locations due to common components.

Figure 5

Figure 5: Comparison of the top singular vectors of SS and the true signal matrix, demonstrating skewed alignment due to noise.

When both joint and individual spikes coexist, the theory predicts— and empirical results confirm— that the corresponding singular vectors remain orthogonal in the limit (Figure 6). However, individual spikes remain spurious and degrade interpretability. Figure 6

Figure 6: Spike locations and alignments for both specific and common components—empirically validating asymptotic orthogonality in large dimensions.

PLS versus PCA in Shared Signal Recovery

Importantly, under the same asymptotic regime, the authors demonstrate that PLS-SVD is strictly superior to principal component analysis (PCA) applied separately to each dataset for detection of shared components. Whenever PCA can detect shared spikes, PLS necessarily detects (and better separates) them, and PLS can succeed in parameter regimes where PCA fails. Vector alignments and phase transition thresholds are analytically compared and visualized (Figure 7), showing that PLS enables signal recovery in unbalanced or weaker signal settings inaccessible to componentwise PCA. Figure 7

Figure 7: Colormaps of vector alignment illustrating the lower detection threshold and improved joint recovery for PLS compared to PCA in the rank-1 case.

Practical and Theoretical Implications

These results have direct implications for multi-modal integrative analysis. The explicit spectral characterizations provide a basis for developing new post-processing and filtering methods that eliminate spurious singular directions (e.g., those induced by individual-specific components) and correct for the skewing of joint singular vectors. The limitations elucidated here suggest both the necessity and the feasibility of improved PLS algorithms— for example, by incorporating structured regularization or iterative filtering analogous to O2PLS [trygg2003o2pls]. The deterministic equivalent framework introduced is extendable to other PLS variants and may be adapted for regression settings and broader families of matrix integration techniques.

Conclusion

This work provides an authoritative spectral theory for PLS-SVD in high-dimensional regimes, rigorously identifying both its principal advantages and its fundamental limitations compared to componentwise PCA. The explicit bulk laws, spike detection thresholds, and alignment formulas extend the mathematical theory of data integration and random matrices, with substantial consequences for the design and understanding of modern statistical learning algorithms. Future directions include adaptation to real-world noise structures, more complex latent designs, and refined post hoc filtering methods for improved interpretability and prediction.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 37 likes about this paper.