Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks (2407.15301v1)

Published 22 Jul 2024 in stat.ML, cs.LG, math.ST, q-bio.QM, and stat.TH

Abstract: Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H\'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Asymptotics of wide convolutional neural networks. arXiv preprint arXiv:2008.08675, 2020.
  2. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  3. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
  4. High-dimensional inference for the average treatment effect under model misspecification using penalized bias-reduced double-robust estimation. Biostatistics & Epidemiology, 6(2):221–238, 2022.
  5. Predictive inference with the jackknife+. Ann. Statist., 49(1):486–507, Feb 2021. doi: 10.1214/20-AOS1965.
  6. ℓℓ\ellroman_ℓ1 - regularized linear regression: persistence and oracle inequalities. Probability theory and related fields, 154(1):193–224, 2012.
  7. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261–2285, August 2019. doi: 10.1214/18-AOS1747.
  8. Dna methylation aging clocks: challenges and recommendations. Genome biology, 20:1–24, 2019.
  9. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2):29–50, 2014a.
  10. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014b.
  11. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34(4):606–619, 2016.
  12. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
  13. High-dimensional inference for dynamic treatment effects. arXiv preprint arXiv:2110.04924, 2021.
  14. Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  15. A Chatterjee and SN Lahiri. Strong consistency of lasso estimators. Sankhya A, 73:55–78, 2011.
  16. Sourav Chatterjee. Assumptionless consistency of the lasso. arXiv preprint arXiv:1303.5817, 2013.
  17. Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021.
  18. The jackknife estimate of variance. The Annals of Statistics, 9(3):586–596, 1981.
  19. Zhe Fei and Yi Li. Estimation and inference for high dimensional generalized linear models: A splitting and smoothing approach. Journal of Machine Learning Research, 22(58):1–32, 2021.
  20. Drawing inferences for high-dimensional linear models: A selection-assisted partial regression and smoothing approach. Biometrics, 75(2):551–561, 2019.
  21. Inference for high-dimensional censored quantile regression. Journal of the American Statistical Association, 118(542):898–912, 2023. PMCID: not yet available.
  22. Edward W Frees. Infinite order u-statistics. Scandinavian Journal of Statistics, 16(1):29–45, 1989.
  23. Asymptotics for lasso-type estimators. The Annals of statistics, 28(5):1356–1378, 2000.
  24. Inference for the case probability in high-dimensional logistic regression. The Journal of Machine Learning Research, 22(1):11480–11533, 2021.
  25. Jaroslav Hájek. Asymptotic normality of simple linear rank statistics under alternatives. The Annals of Mathematical Statistics, 39(2):325–346, 1968.
  26. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. In Breakthroughs in statistics, pages 308–334. Springer, 1992.
  27. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  28. Steve Horvath. Dna methylation age of human tissues and cell types. Genome biology, 14(10):1–20, 2013.
  29. Svante Janson. The asymptotic distributions of incomplete u-statistics. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 66(4):495–505, 1984.
  30. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  31. A review of nonconformity measures for conformal prediction in regression. Conformal and Probabilistic Prediction with Applications, pages 369–383, 2023.
  32. Lower upper bound estimation method for construction of neural network-based prediction intervals. IEEE transactions on neural networks, 22(3):337–346, 2010.
  33. Predictive inference is free with the jackknife+-after-bootstrap. Advances in Neural Information Processing Systems, 33:4138–4149, 2020.
  34. Adaptive regression estimation with multilayer feedforward neural networks. Nonparametric Statistics, 17(8):891–913, 2005.
  35. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
  36. High dimensional robust inference for cox regression models using de-sparsified lasso. Scandinavian Journal of Statistics, 48(3):1068–1095, 2021. doi: 10.1111/sjos.12543.
  37. Kumarajarshi. Life expectancy (who), 2023. URL https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who. Kaggle dataset based on WHO data.
  38. A J Lee. U-statistics: Theory and Practice. Routledge, 2019.
  39. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
  40. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  41. Dna methylation levels at individual age-associated cpg sites can be indicative for life expectancy. Aging (Albany NY), 8(2):394, 2016.
  42. Dna methylation grimage strongly predicts lifespan and healthspan. Aging (albany NY), 11(2):303, 2019.
  43. Dna methylation age of blood predicts all-cause mortality in later life. Genome biology, 16(1):1–12, 2015.
  44. Convergence rates for single hidden layer feedforward networks. Neural Networks, 7(1):147–158, 1994.
  45. Investigating the relationship between dna methylation age acceleration and risk factors for alzheimer’s disease. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 10:429–437, 2018.
  46. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016.
  47. Conformal multi-target regression using neural networks. In Conformal and Probabilistic Prediction and Applications, pages 65–83. PMLR, 2020.
  48. Utility of dna methylation as a biomarker in aging and alzheimer’s disease. Journal of Alzheimer’s Disease Reports, 7(1):475–503, 2023.
  49. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
  50. Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In Tools in artificial intelligence. Citeseer, 2008.
  51. Conformal prediction with neural networks. In 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), volume 2, pages 388–395. IEEE, 2007.
  52. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731–771, 2011.
  53. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591, 2020.
  54. Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
  55. Quantifying uncertainty in neural network ensembles using u-statistics. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
  56. Association of pace of aging measured by blood-based dna methylation with age-related cognitive impairment and dementia. Neurology, 99(13):e1402–e1413, 2022.
  57. Conformal prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/8fb21ee7a2207526da55a679f0332de2-Paper.pdf.
  58. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  59. Aad W Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000.
  60. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  61. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  62. Quantifying uncertainty of subsampling-based ensemble methods under a u-statistic framework. Journal of Statistical Computation and Simulation, 92(17):3706–3726, 2022.
  63. Epigenetic aging: More than just a clock when it comes to cancer. Cancer research, 80(3):367–374, 2020.
  64. Asymptotics of representation learning in finite bayesian neural networks. Advances in neural information processing systems, 34:24765–24777, 2021.
  65. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.
  66. Bootstrap prediction intervals with asymptotic conditional validity and unconditional guarantees. Information and Inference: A Journal of the IMA, 12(1):157–209, 2023.
  67. On model selection consistency of lasso. Journal of Machine Learning Research, 7(Nov):2541–2563, 2006.
  68. Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583–1600, 2018.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com