Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Grouping predictors via network-wide metrics (2405.02715v1)

Published 4 May 2024 in stat.ME, math.ST, and stat.TH

Abstract: When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Altenbuchinger, M., Weihs, A., Quackenbush, J., Jörgen, G. H., and Zacharias, H. (2020), “Gaussian and Mixed Graphical Models as (multi-) omics data analysis tools,” Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1863.
  2. Antoniou, I. and Tsompa, E. (2008), “Statistical analysis of weighted networks,” Discrete dynamics in Nature and Society, 2008.
  3. Apostol, T. M. (1969), “Calculus, Volume II: Multi-variable calculus and linear algebra, with applications to differential equations and probability. Blaisdell Publishing Co., Ginn and Co., Waltham, Mass,” Toronto, Ont.
  4. Baseball-Reference (2016), “Baseball Reference 2016 Regular Season Stats,” https://www.baseball-reference.com/.
  5. Bühlmann, P., Rütimann, P., van de Geer, S., and Zhang, C.-H. (2013), “Correlated variables in regression: clustering and sparse estimation,” Journal of Statistical Planning and Inference, 143, 1835–1858.
  6. De la Fuente, A., Bing, N., Hoeschele, I., and Pedro, M. (2004), “Discovery of meaningful associations in genomic data using partial correlation coefficients,” Bioinformatics, 20, 3565–3574.
  7. Fan, J. and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348.
  8. Fan, J. and Lv, J. (2011), “Nonconcave penalized likelihood with NP-dimensionality,” IEEE Transactions on Information Theory, 57, 5467–5484.
  9. Khalili, A. and Vidyashankar, A. (2018), “Hypothesis Testing in Finite Mixture of Regressions: Sparsity anad Model Selection Uncertainty,” Canadian Journal of Statistics, 46, 429–457.
  10. Kim, Y., Choi, H., and Oh, H.-S. (2008), “Smoothly clipped absolute deviation on high dimensions,” Journal of the American Statistical Association, 103, 1665–1673.
  11. Kim, Y. and Kwon, S. (2012), “Global optimality of nonconvex penalized estimators,” Biometrika, 99, 315–325.
  12. Lopez-Fernandez, L., Robles, G., Gonzalez-Barahona, J. M., et al. (2004), “Applying social network analysis to the information in CVS repositories,” in International workshop on mining software repositories, IET, pp. 101–105.
  13. Meinshausen, N. and Bühlmann, P. (2010), “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417–473.
  14. MLB (2016), “MLB 2016 Regular Season Stats,” https://www.mlb.com/.
  15. Neudecker, H. and Wesselman, A. M. (1990), “The asymptotic variance matrix of the sample correlation matrix,” Linear Algebra and its Applications, 127, 589–599.
  16. Opsahl, T., Agneessens, F., and Skvoretz, J. (2010), “Node centrality in weighted networks: Generalizing degree and shortest paths,” Social networks, 32, 245–251.
  17. Petersen, K. B., Pedersen, M. S., et al. (2008), “The matrix cookbook,” Technical University of Denmark, 7, 510.
  18. Reverter, A. and Chan, E. K. (2008), “Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks,” Bioinformatics, 24, 2491–2497.
  19. Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
  20. Yuan, M. and Lin, Y. (2006), “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.
  21. Zhang, C.-H. et al. (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of statistics, 38, 894–942.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com