Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank (2404.17626v2)

Published 26 Apr 2024 in cs.LG, stat.AP, stat.CO, and q-bio.QM

Abstract: Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy

Summary

The paper demonstrates that pretrained lasso and glinternet models yield statistically significant improvements in disease prediction, with 16 of 96 models showing enhanced ROC-AUC scores.
It utilizes advanced interaction modeling to leverage data from both White British and diverse ancestries, effectively addressing data imbalance and scarcity.
The study lays foundational work for refining genomic prediction strategies, offering promising avenues for personalized medicine in underrepresented populations.

Utilizing Pre-training and Interaction Modeling for Ancestry-Specific Disease Prediction in Multiomic Data

Overview of Study

This paper evaluates the utilality of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in enhancing disease prediction across various ancestries within the UK Biobank, focusing on 8 diseases. By deploying these models in data encompassing individuals from White British and various other ancestries (South Asian, African, non-British European, and admixed), this research identifies 16 out of 96 models that demonstrate statistically significant improvements in prediction performance as denoted by ROC-AUC scores. The paper underscores moderate improvements that advanced statistical techniques can bring to disease risk predictions across ancestrally diverse data.

Contribution to Statistical Methods

Pretraining and Interaction Modeling Techniques

The paper's significant contribution lies in its innovative use of statistical methods tailored to overcome the scarcity and imbalance of data across different ancestries:

Pretrained Lasso: Enhancing disease prediction in specific ancestries by leveraging patterns identified in the predominant White British data.
Glinternet: Employs interaction modeling integrated with the L1 penalty, facilitating the retention of interaction terms only if the involved variables demonstrate strong individual effects.

These methods are promising in scenarios where data for one population is limited, which is a critical advancement given the UK Biobank's dataset biases.

Model Performance

Findings suggest an incremental predictive performance where the glinternet and pretrained lasso models achieved statistical significance in 16 out of 96 models with p-values less than 0.05. This illustrates a case where borrowing strength across multiple ancestries can improve disease prediction capabilities.

Theoretical and Practical Implications

Theoretical Insights

The paper provides critical insights into the adaptation of GLMs and interaction models in a multiomic and multicultural setup. By successfully implementing pretraining and interaction terms, this research broadens the scope of machine learning approaches in genetic studies, potentially informing the methodology for future genomic research.

Practical Implications

This paper directly impacts the practical field of genetic epidemiology by potentially aiding in the development of more accurate predictive models for disease across different ancestral backgrounds. This is particularly crucial for improving the health outcomes of underrepresented groups in genetic databases.

Speculation on Future Developments

Looking ahead, the principles and methodologies defined in this paper could be extended beyond the confines of the UK Biobank to include more genetically and ethnically diverse datasets. Further research could explore the full potential of these models in even larger cohorts with broader ancestral diversity. There's also the prospective integration of these models with newer AI techniques like deep learning to further refine and enhance predictive accuracies.

Conclusions

In conclusion, the paper presents a substantial advancement in the field of genetic epidemiology, particularly in the prediction of disease across diverse ancestries using complex statistical models. While the improvements reported are modest, they are statistically significant and pave the way for further refining and understanding of disease prediction models that account for genetic diversity more comprehensively. Also, acknowledging the limited benefits observed, there remains considerable scope for exploring these techniques in other datasets and diseases to harness their full potential.

PDF Markdown

Related Papers

Tweets

https://twitter.com/manuelrivascruz/status/1785147468155641908

https://twitter.com/XTXI/status/1788104493340508526

https://twitter.com/StatCOupdates/status/1788406155964022802

https://twitter.com/StatCOupdates/status/1785506996801949843

https://twitter.com/gastronomy/status/1785159256984985863