Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Partition-Based Cross-Validation With Centering and Scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ (2401.13185v3)

Published 24 Jan 2024 in cs.LG and cs.DS

Abstract: We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products $\mathbf{X}\mathbf{T}\mathbf{X}$ and $\mathbf{X}\mathbf{T}\mathbf{Y}$. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of $\mathbf{X}$ and $\mathbf{Y}$, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing $\mathbf{X}\mathbf{T}\mathbf{X}$ and $\mathbf{X}\mathbf{T}\mathbf{Y}$ and space complexity equivalent to storing $\mathbf{X}$, $\mathbf{Y}$, $\mathbf{X}\mathbf{T}\mathbf{X}$, and $\mathbf{X}\mathbf{T}\mathbf{Y}$. Importantly, unlike alternatives found in the literature, we avoid data leakage due to preprocessing. We achieve these results by eliminating redundant computations in the overlap between training partitions. Concretely, we show how to manipulate $\mathbf{X}\mathbf{T}\mathbf{X}$ and $\mathbf{X}\mathbf{T}\mathbf{Y}$ using only samples from the validation partition to obtain the preprocessed training partition-wise $\mathbf{X}\mathbf{T}\mathbf{X}$ and $\mathbf{X}\mathbf{T}\mathbf{Y}$. To our knowledge, we are the first to derive correct and efficient cross-validation algorithms for any of the $16$ combinations of column-wise centering and scaling, for which we also prove only $12$ give distinct matrix products.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. A. Alin. Comparison of pls algorithms when number of objects is much larger than number of variables. Statistical papers, 50:711–720, 2009.
  2. M. Andersson. A comparison of nine pls1 algorithms. Journal of Chemometrics: A Journal of the Chemometrics Society, 23(10):518–529, 2009.
  3. M. Barker and W. Rayens. Partial least squares for discrimination. Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3):166–173, 2003.
  4. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied spectroscopy, 43(5):772–777, 1989.
  5. Daniel Berrar et al. Cross-validation., 2019.
  6. Improved pls algorithms. Journal of Chemometrics: A Journal of the Chemometrics Society, 11(1):73–85, 1997.
  7. Quantitative assessment of wheat quality using near-infrared spectroscopy: A comprehensive review. Comprehensive Reviews in Food Science and Food Safety, 21(3):2956–3009, 2022.
  8. Fast cross-validation algorithm with improved kernel partial least squares for python: Fast cpu and gpu implementations with numpy and jax (under review). The Journal of Open Source Software, 2024.
  9. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  10. F. Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296–303, 2014.
  11. Much faster cross-validation in plsr-modelling by avoiding redundant calculations. Journal of Chemometrics, 34(3):e3201, 2020.
  12. The kernel algorithm for pls. Journal of Chemometrics, 7(1):45–59, 1993.
  13. Kernel-based pls regression; cross-validation and applications to spectral data. Journal of chemometrics, 8(6):377–389, 1994.
  14. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28(10):1201–1222, 2009.
  15. A. Savitzky and M. J. E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry, 36(8):1627–1639, 1964.
  16. Nir data exploration and regression by chemometrics—a primer. Near-Infrared Spectroscopy: Theory, Spectral Analysis, Instrumentation, and Applications, pages 127–189, 2021.
  17. Orders of magnitude speed increase in partial least squares feature selection with new simple indexing technique for very tall data sets. Journal of Chemometrics, 33(11):e3141, 2019.
  18. H. Wold. Estimation of principal components and related models by iterative least squares. Multivariate analysis, pages 391–420, 1966.
  19. Pls-regression: a basic tool of chemometrics. Chemometrics and intelligent laboratory systems, 58(2):109–130, 2001.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets