A Stability Framework for Parameter Selection in the Minimum Covariance Determinant Problem
Abstract: The Minimum Covariance Determinant (MCD) method is a widely adopted tool for robust estimation and outlier detection. In this paper, we introduce MCD model selection based on the notion of stability. Our best subset method leverages prior best practices such as statistical depths for initialization and concentration steps for subset refinement. Our contribution lies in constructing a bootstrap procedure to estimate the instability of the best subset algorithm. The instability path offers insights into a dataset's inlier/outlier structure and facilitates suitable choice of the subset size. We rigorously benchmark the proposed framework against existing MCD variants and illustrate its practical utility on several real-world datasets.
- Agostinelli, C., Leung, A., Yohai, V. J., and Zamar, R. H. (2015), “Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination,” Test, 24, 441–461.
- Allison, K. H., Hammond, M. E. H., Dowsett, M., McKernin, S. E., Carey, L. A., Fitzgibbons, P. L., Hayes, D. F., Lakhani, S. R., Chavez-MacGregor, M., Perlmutter, J., Perou, C. M., Regan, M. M., Rimm, D. L., Symmans, W. F., Torlakovic, E. E., Varella, L., Viale, G., Weisberg, T. F., McShane, L. M., and Wolff, A. C. (2020), “Estrogen and Progesterone Receptor Testing in Breast Cancer: ASCO/CAP Guideline Update,” Journal of Clinical Oncology, 38, 1346–1366, pMID: 31928404.
- Bauer, K. R., Brown, M., Cress, R. D., Parise, C. A., and Caggiano, V. (2007), “Descriptive analysis of estrogen receptor (ER)-negative, progesterone receptor (PR)-negative, and HER2-negative invasive breast cancer, the so-called triple-negative phenotype,” Cancer, 109, 1721–1728.
- Berenguer-Rico, V., Johansen, S., and Nielsen, B. (2023), “A model where the least trimmed squares estimator is maximum likelihood,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 85, 886–912.
- Boudt, K., Rousseeuw, P. J., Vanduffel, S., and Verdonck, T. (2020), “The minimum regularized covariance determinant estimator,” Statistics and Computing, 30, 113–128.
- Butler, R., Davies, P., and Jhun, M. (1993), “Asymptotics for the minimum covariance determinant estimator,” The Annals of Statistics, 1385–1400.
- Cator, E. A. and Lopuhaä, H. P. (2012), “Central limit theorem and influence function for the MCD estimators at general multivariate distributions,” Bernoulli, 18, 520 – 551.
- Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A., Jacobsen, A., Byrne, C. J., Heuer, M. L., Larsson, E., Antipin, Y., Reva, B., Goldberg, A. P., Sander, C., and Schultz, N. (2012), “The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data,” Cancer Discovery, 2, 401–404.
- Cerioli, A. (2010), “Multivariate outlier detection with high-breakdown estimators,” Journal of the American Statistical Association, 105, 147–156.
- Fang, Y. and Wang, J. (2012), “Selection of the number of clusters via the bootstrap method,” Computational Statistics & Data Analysis, 56, 468–477.
- Filzmoser, P., Maronna, R., and Werner, M. (2008), “Outlier identification in high dimensions,” Computational statistics & data analysis, 52, 1694–1711.
- Grübel, R. (1988), “A minimal characterization of the covariance matrix,” Metrika, 35, 49–52.
- Hardin, J. and Rocke, D. M. (2005), “The distribution of robust distances,” Journal of Computational and Graphical Statistics, 14, 928–946.
- Haslbeck, J. M. and Wulff, D. U. (2020), “Estimating the number of clusters via a corrected clustering instability,” Computational Statistics, 35, 1879–1894.
- Hubert, M. and Debruyne, M. (2010), “Minimum covariance determinant,” Wiley interdisciplinary reviews: Computational statistics, 2, 36–43.
- Hubert, M., Debruyne, M., and Rousseeuw, P. J. (2018), “Minimum covariance determinant and extensions,” Wiley Interdisciplinary Reviews: Computational Statistics, 10, e1421.
- Hubert, M., Rousseeuw, P. J., and Aelst, S. V. (2008), “High-Breakdown Robust Multivariate Methods,” Statistical Science, 23, 92 – 119.
- Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005), “ROBPCA: a new approach to robust principal component analysis,” Technometrics, 47, 64–79.
- Hubert, M., Rousseeuw, P. J., and Verdonck, T. (2012), “A deterministic algorithm for robust location and scatter,” Journal of Computational and Graphical Statistics, 21, 618–637.
- Hubert, M. and Van Driessen, K. (2004), “Fast and robust discriminant analysis,” Computational Statistics & Data Analysis, 45, 301–320.
- Lange, T., Mosler, K., and Mozharovskyi, P. (2014), “Fast nonparametric classification based on data depth,” Statistical Papers, 55, 49–69.
- Li, C. and Jin, B. (2022), “Outlier Detection via a Block Diagonal Product Estimator,” Journal of Systems Science and Complexity, 35, 1929–1943.
- Li, C., Jin, B., and Wu, Y. (2024), “Outlier detection via a minimum ridge covariance determinant estimator,” Statistica Sinica, just–accepted.
- Network, T. C. G. A. (2012), “Comprehensive molecular portraits of human breast tumours,” Nature, 490, 61–70.
- Pokotylo, O., Mozharovskyi, P., and Dyckerhoff, R. (2016), “Depth and depth-based classification with R-package ddalpha,” arXiv preprint arXiv:1608.04109.
- Raymaekers, J. and Rousseeuw, P. J. (2023), “The cellwise minimum covariance determinant estimator,” Journal of the American Statistical Association, just–accepted.
- Ro, K., Zou, C., Wang, Z., and Yin, G. (2015), “Outlier detection for high-dimensional data,” Biometrika, 102, 589–599.
- Rousseeuw, P. J. (1985), “Multivariate estimation with high breakdown point,” Mathematical statistics and applications, 8, 37.
- Rousseeuw, P. J. and Driessen, K. V. (1999), “A fast algorithm for the minimum covariance determinant estimator,” Technometrics, 41, 212–223.
- Schreurs, J., Vranckx, I., Hubert, M., Suykens, J. A., and Rousseeuw, P. J. (2021), “Outlier detection in non-elliptical data by kernel MRCD,” Statistics and Computing, 31, 66.
- Sun, W., Wang, J., and Fang, Y. (2013), “Consistent selection of tuning parameters via variable selection stability,” The Journal of Machine Learning Research, 14, 3419–3440.
- Tibshirani, R., Walther, G., and Hastie, T. (2001), “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 411–423.
- Wang, J. (2010), “Consistent selection of the number of clusters via crossvalidation,” Biometrika, 97, 893–904.
- Welford, B. (1962), “Note on a method for calculating corrected sums of squares and products,” Technometrics, 4, 419–420.
- Zahariah, S. and Midi, H. (2023), “Minimum regularized covariance determinant and principal component analysis-based method for the identification of high leverage points in high dimensional sparse data,” Journal of Applied Statistics, 50, 2817–2835.
- Zhang, M., Song, Y., and Dai, W. (2023), “Fast robust location and scatter estimation: a depth-based method,” Technometrics, just–accepted.
- Zuo, Y. (2006), “Multidimensional trimming based on projection depth,” The Annals of Statistics, 34, 2211 – 2251.
- Zuo, Y. and Serfling, R. (2000), “General notions of statistical depth function,” The Annals of Statistics, 461–482.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.