Multi-source domain adaptation for regression (2312.05460v1)
Abstract: Multi-source domain adaptation (DA) aims at leveraging information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. Most existing methods for multi-source DA focus on classification problems while there is only limited investigation in the regression settings. In this paper, we fill in this gap through a two-step procedure. First, we extend a flexible single-source DA algorithm for classification through outcome-coarsening to enable its application to regression problems. We then augment our single-source DA algorithm for regression with ensemble learning to achieve multi-source DA. We consider three learning paradigms in the ensemble algorithm, which combines linearly the target-adapted learners trained with each source domain: (i) a multi-source stacking algorithm to obtain the ensemble weights; (ii) a similarity-based weighting where the weights reflect the quality of DA of each target-adapted learner; and (iii) a combination of the stacking and similarity weights. We illustrate the performance of our algorithms with simulations and a data application where the goal is to predict High-density lipoprotein (HDL) cholesterol levels using gut microbiome. We observe a consistent improvement in prediction performance of our multi-source DA algorithm over the routinely used methods in all these scenarios.
- The current and future use of ridge regression for prediction in quantitative genetics. BioMed research international, 2015, 2015.
- Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research, 30(1):207–210, 2002.
- A brief review of domain adaptation. Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021.
- The gut microbiome contributes to a substantial proportion of the variation in blood lipids. Circulation research, 117(9):817–824, 2015.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
- A unified view of label shift estimation. Advances in Neural Information Processing Systems, 33:3290–3300, 2020.
- Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839–2848. PMLR, 2016.
- Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
- Merging versus ensembling in multi-study prediction: Theoretical insight from random effects. arXiv preprint arXiv:1905.07382, 2019.
- Multi-source domain adaptation with mixture of experts. arXiv preprint arXiv:1809.02256, 2018.
- A predictive index for health status using species-level gut microbiome profiling. Nature communications, 11(1):1–16, 2020.
- Frank E Harrell et al. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis, volume 608. Springer, 2001.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nature microbiology, 2(1):1–13, 2016.
- Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature, 498(7452):99–103, 2013.
- Cholesterol metabolism by uncultured human gut bacteria influences host cholesterol level. Cell host & microbe, 28(2):245–257, 2020.
- The intestinal microbiota regulates host cholesterol homeostasis. BMC biology, 17(1):1–18, 2019.
- Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122–3130. PMLR, 2018.
- Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing. The annals of applied statistics, 16(4):2145–2165, 2022.
- Genotype score in addition to common risk factors for prediction of type 2 diabetes. New England Journal of Medicine, 359(21):2208–2219, 2008.
- The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics, 15(4):869–877, 2005.
- Arrayexpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic acids research, 39(suppl_1):D1002–D1004, 2010.
- Accessible, curated metagenomic data through experimenthub. Nature methods, 14(11):1023–1024, 2017.
- Test set bias affects reproducibility of gene signatures. Bioinformatics, 31(14):2318–2323, 2015.
- Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115(11):2578–2583, 2018.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
- A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 490(7418):55–60, 2012.
- Tree-weighting for multi-study ensemble learners. In Pacific Symposium on Biocomputing 2020, pages 451–462. World Scientific, 2019.
- Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the National Academy of Sciences, 101(25):9309–9314, 2004.
- Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
- Aggregating from multiple target-shifted sources. In International Conference on Machine Learning, pages 9638–9648. PMLR, 2021.
- Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1):14–18, 2003.
- Assessment of variation in microbial community amplicon sequencing by the microbiome quality control (mbqc) project consortium. Nature biotechnology, 35(11):1077–1086, 2017.
- Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33:19276–19289, 2020.
- Domain aggregation networks for multi-source domain adaptation. In International conference on machine learning, pages 10214–10224. PMLR, 2020.
- Domain adaptation under target and conditional shift. In International conference on machine learning, pages 819–827. PMLR, 2013.
- The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics, 21(2):253–268, 2020.
- Multi-source domain adaptation in the deep learning era: A systematic survey. arXiv preprint arXiv:2002.12169, 2020.