Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets (2312.12786v2)

Published 20 Dec 2023 in stat.ME, stat.AP, and stat.ML

Abstract: Development of comprehensive prediction models are often of great interest in many disciplines of science, but datasets with information on all desired features often have small sample sizes. We describe a transfer learning approach for building high-dimensional generalized linear models using data from a main study with detailed information on all predictors and an external, potentially much larger, study that has ascertained a more limited set of predictors. We propose using the external dataset to build a reduced model and then "transfer" the information on underlying parameters for the analysis of the main study through a set of calibration equations which can account for the study-specific effects of design variables. We then propose a penalized generalized method of moment framework for inference and a one-step estimation method that could be implemented using standard glmnet package. We develop asymptotic theory and conduct extensive simulation studies to investigate both predictive performance and post-selection inference properties of the proposed method. Finally, we illustrate an application of the proposed method for the development of risk models for five common diseases using the UK Biobank study, combining information on low-dimensional risk factors and high throughout proteomic biomarkers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1), pp. 289–300. Cited by: §3.1.
  2. Transfer learning for drug discovery. Journal of Medicinal Chemistry 63 (16), pp. 8683–8694. Cited by: §1.
  3. Adaptive elastic net for generalized methods of moments. Journal of Business & Economic Statistics 32 (1), pp. 30–47. Cited by: §1, §2.2, §2.3, §2.3, §6.1.2.
  4. Lasso-type gmm estimator. Econometric Theory 25 (1), pp. 270–290. Cited by: §1, §2.2, §2.3, §2.3, §6.1.3.
  5. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association 111 (513), pp. 107–117. Cited by: §1, §1, §2.2, §2.2, §5, §6.1.2.
  6. Generalized linear models incorporating population level information: an empirical-likelihood-based approach. Journal of the Royal Statistical Society Series B: Statistical Methodology 70 (2), pp. 311–328. Cited by: §1.
  7. A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statistica Sinica, pp. 385–406. Cited by: §1.
  8. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. Journal of the American Statistical Association 104 (485), pp. 220–233. Cited by: §5.
  9. A survey on heterogeneous transfer learning. Journal of Big Data 4, pp. 1–42. Cited by: §1.
  10. Calibration estimators in survey sampling. Journal of the American Statistical Association 87 (418), pp. 376–382. Cited by: §1.
  11. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109 (1), pp. 67–83. Cited by: §2.1, §5.
  12. Blood protein levels predict leading incident diseases and mortality in uk biobank. medRxiv, pp. 2023–05. Cited by: §4.
  13. A meta-inference framework to integrate multiple external models into a current study. Biostatistics 24 (2), pp. 406–424. Cited by: §5.
  14. Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica 29 (3), pp. 1321–1342. Cited by: §5.
  15. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pp. 1029–1054. Cited by: §1, §2.2.
  16. Combining micro and macro data in microeconometric models. The Review of Economic Studies 61 (4), pp. 655–680. Cited by: §1.
  17. Journal of Econometrics 74 (2), pp. 289–318. Cited by: §1, §2.2.
  18. Calibration estimation in survey sampling. International Statistical Review 78 (1), pp. 21–39. Cited by: §1.
  19. Logistic regression analysis of two-phase studies using generalized method of moments. Biometrics 79 (1), pp. 241–252. Cited by: §5.
  20. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106 (3), pp. 567–585. Cited by: §1, §1, §5, §6.1.2.
  21. Transfer learning for high-dimensional linear regression: prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology 84 (1), pp. 149–173. Cited by: §1.
  22. Generalized linear models. Routledge. Cited by: §2.1.
  23. Exploiting gene-environment independence for analysis of case–control studies: an empirical bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64 (3), pp. 685–694. Cited by: §5.
  24. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §1.
  25. Empirical likelihood and general estimating equations. the Annals of Statistics 22 (1), pp. 300–325. Cited by: §1.
  26. Empirical likelihood in missing data problems. Journal of the American Statistical Association 104 (488), pp. 1492–1503. Cited by: §1.
  27. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: Code Availability.
  28. Transfusion: understanding transfer learning for medical imaging. Advances in neural information processing systems 32. Cited by: §1.
  29. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika, pp. 365–375. Cited by: §1.
  30. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10 (3), pp. e0118432. Cited by: §4.
  31. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12 (3), pp. e1001779. Cited by: §1, §4.
  32. Data integration: exploiting ratios of parameter estimates from a reduced external model. Biometrika 110 (1), pp. 119–134. Cited by: §1.
  33. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Frontiers in Genetics 13, pp. 906965. Cited by: §1.
  34. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, pp. 1–14. Cited by: §1.
  35. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1, §2.2.
  36. Inference in high dimensions with the penalized score test. arXiv preprint arXiv:1401.2678. Cited by: §6.1.1.
  37. A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association 96 (453), pp. 185–193. Cited by: §1.
  38. Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics 31 (4), pp. 1001–1012. Cited by: §5.
  39. The construction of cross-population polygenic risk scores using transfer learning. The American Journal of Human Genetics 109 (11), pp. 1998–2008. Cited by: §1.
  40. Transfer learning in deep reinforcement learning: a survey. arXiv preprint arXiv:2009.07888. Cited by: §1.
  41. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67 (2), pp. 301–320. Cited by: §2.2.
  42. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476), pp. 1418–1429. Cited by: §1, §2.2, §2.3, §2.3, §6.1.2, §6.1.2, §6.1.2.
  43. Risk factors for asthma: is prevention possible?. The Lancet 386 (9998), pp. 1075–1085. Cited by: §S3.
  44. Circulation research 120 (3), pp. 472–495. Cited by: §S3.
  45. Adaptive elastic net for generalized methods of moments. Journal of Business & Economic Statistics 32 (1), pp. 30–47. Cited by: §S2.2.
  46. Lasso-type gmm estimator. Econometric Theory 25 (1), pp. 270–290. Cited by: §S2.2.
  47. Prediction models for cardiovascular disease risk in the general population: systematic review. bmj 353. Cited by: §S3.
  48. Blood protein levels predict leading incident diseases and mortality in uk biobank. medRxiv, pp. 2023–05. Cited by: §S3.
  49. Clinics in colon and rectal surgery 22 (04), pp. 191–197. Cited by: §S3.
  50. Meta-analyses of colorectal cancer risk factors. Cancer causes & control 24, pp. 1207–1222. Cited by: §S3.
  51. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106 (3), pp. 567–585. Cited by: §S2.2.
  52. Risk factors for the diagnosis of colorectal cancer. Cancer Control 29, pp. 10732748211056692. Cited by: §S3.
  53. Cancers 13 (17), pp. 4287. Cited by: §S3.
  54. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12 (3), pp. e1001779. Cited by: §S3.
  55. Risk factors and preventions of breast cancer. International journal of biological sciences 13 (11), pp. 1387. Cited by: §S3.
  56. Asthma risk factors. In International forum of allergy & rhinology, Vol. 5, pp. S11–S16. Cited by: §S3.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.