Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Statistical Inference for High-Dimensional Linear Regression with Blockwise Missing Data (2106.03344v2)

Published 7 Jun 2021 in stat.ME, math.ST, stat.ML, and stat.TH

Abstract: Blockwise missing data occurs frequently when we integrate multisource or multimodality data where different sources or modalities contain complementary information. In this paper, we consider a high-dimensional linear regression model with blockwise missing covariates and a partially observed response variable. Under this framework, we propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations and a blockwise imputation procedure, and obtain its rate of convergence. Furthermore, building upon an innovative projected estimating equation technique that intrinsically achieves bias-correction of the initial estimator, we propose a nearly unbiased estimator for each individual regression coefficient, which is asymptotically normally distributed under mild conditions. Based on these debiased estimators, asymptotically valid confidence intervals and statistical tests about each regression coefficient are constructed. Numerical studies and application analysis of the Alzheimer's Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Baraldi, A. N. and C. K. Enders (2010). An introduction to modern missing data analyses. Journal of School Psychology 48(1), 5–37.
  2. Brain volumes and their ratios in Alzheimer s disease on magnetic resonance imaging segmented using Freesurfer 6.0. Psychiatry Research: Neuroimaging 287, 70–74.
  3. On the prediction loss of the lasso in the partially labeled setting. Electronic Journal of Statistics 12(2), 3443–3472.
  4. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 1165–1188.
  5. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37(4), 1705–1732.
  6. Structured matrix completion with applications to genomic data integration. Journal of the American Statistical Association 111(514), 621–633.
  7. Semi-supervised inference for explained variance in high-dimensional regression and its applications. Journal of the Royal Statistical Society: Series B 82, 391–419.
  8. Evaluating imputation techniques for missing data in ADNI: A patient classification study. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp.  3–10. Springer International Publishing.
  9. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35(6), 2313–2351.
  10. Imputation for high-dimensional linear regression. arXiv preprint arXiv:2001.09180.
  11. Mini mental state examination and logical memory scores for entry into Alzheimer’s disease trials. Alzheimer’s research & therapy 8(1), 9.
  12. A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70(2), 312–322.
  13. Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv preprint arXiv:2011.14185.
  14. An EM algorithm for estimating equations. Journal of Computational and Graphical Statistics 13(1), 48–65.
  15. Elevated levels of secreted-frizzled-related-protein 1 contribute to Alzheimer’s disease pathogenesis. Nature Neuroscience 22(8), 1258–1268.
  16. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B 70(5), 849–911.
  17. Variable selection for regression models with missing data. Statistica Sinica 20(1), 149.
  18. Regulatory function of praja ring finger ubiquitin ligase 2 mediated by the P2rx3/P2rx7 axis in mouse hippocampal neuronal cells. American Journal of Physiology-Cell Physiology 318(6), C1123–C1135.
  19. GTEx Consortium (2017). Genetic effects on gene expression across human tissues. Nature 550(7675), 204–213.
  20. Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(1), 173–190.
  21. False discovery rate control via debiased lasso. Electronic Journal of Statistics 13(1), 1212–1253.
  22. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research 15(1), 2869–2909.
  23. Debiasing the lasso: Optimal sample size for Gaussian designs. The Annals of Statistics 46(6A), 2593–2622.
  24. Bayesian hidden markov models for delineating the pathology of Alzheimer’s disease. Statistical Methods in Medical Research 28(7), 2112–2124.
  25. Prediction of autopsy verified neuropathological change of Alzheimer’s disease using machine learning and MRI. Frontiers in Aging Neuroscience 10, 406.
  26. Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics 12(6), 417–428.
  27. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106(3), 567–585.
  28. Structural brain imaging in Alzheimer’s disease and mild cognitive impairment: Biomarker analysis and shared morphometry database. Scientific Reports 8(1), 1–16.
  29. Little, R. J. and D. B. Rubin (2019). Statistical analysis with missing data, Volume 793. John Wiley & Sons.
  30. The Genotype-Tissue Expression (GTEx) project. Nature genetics 45(6), 580–585.
  31. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 1–15.
  32. The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics 15(4), 869–877.
  33. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Technical Report Number 979.
  34. Negahban, S. and M. J. Wainwright (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research 13(1), 1665–1697.
  35. A unified theory of confidence regions and testing for high-dimensional estimating equations. Statistical Science 33(3), 427–443.
  36. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics 45(1), 158–195.
  37. Efficient augmented inverse probability weighted estimation in missing data problems. Journal of Business & Economic Statistics 35(1), 86–97.
  38. Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research 11, 2241–2259.
  39. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89(427), 846–866.
  40. Glucose metabolism during resting state reveals abnormal brain networks organization in the Alzheimer’s disease and mild cognitive impairment. PloS One 8(7), e68860.
  41. Introduction to double robust methods for incomplete data. Statistical Science 33(2), 184.
  42. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202.
  43. Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics 6, 38–90.
  44. Bi-level multi-source learning for heterogeneous block-wise missing data. NeuroImage 102, 192–206.
  45. Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 1–14.
  46. Study of brain morphology change in Alzheimer’s disease and amnestic mild cognitive impairment compared with normal controls. General Psychiatry 32(2).
  47. Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association 115(531), 1406–1419.
  48. Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. NeuroImage 61(3), 622–632.
  49. Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B 76(1), 217–242.
  50. High-dimensional semi-supervised learning: in search for optimal inference of the mean. arXiv preprint arXiv:1902.00772.
  51. Zhang, Z. (2016). Missing data imputation: Focusing on single imputation. Annals of Translational Medicine 4(1), 9.
  52. Significance of normalization on anatomical mri measures in predicting Alzheimer’s disease. The Scientific World Journal 2014.
Citations (1)

Summary

We haven't generated a summary for this paper yet.