Predicting loss-of-function impact of genetic mutations: a machine learning approach (2402.00054v1)
Abstract: The innovation of next-generation sequencing (NGS) techniques has significantly reduced the price of genome sequencing, lowering barriers to future medical research; it is now feasible to apply genome sequencing to studies where it would have previously been cost-inefficient. Identifying damaging or pathogenic mutations in vast amounts of complex, high-dimensional genome sequencing data may be of particular interest to researchers. Thus, this paper's aims were to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores (which measure a gene's intolerance to loss-of-function mutations). These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were built using the univariate feature selection technique f-regression combined with K-nearest neighbors (KNN), Support Vector Machine (SVM), Random Sample Consensus (RANSAC), Decision Trees, Random Forest, and Extreme Gradient Boosting (XGBoost). These models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance. The findings of this study include the training of multiple models with testing set r-squared values of 0.97.
- C. Caudai, A. Galizia, F. Geraci, L. Le Pera, V. Morea, E. Salerno, A. Via, and T. Colombo, “Ai applications in functional genomics,” Computational and Structural Biotechnology Journal, vol. 19, pp. 5762–5790, 2021.
- H. A. Shihab, M. F. Rogers, J. Gough, M. Mort, D. N. Cooper, I. N. Day, T. R. Gaunt, and C. Campbell, “An integrative approach to predicting the functional effects of non-coding and coding sequence variation,” Bioinformatics, vol. 31, no. 10, pp. 1536–1543, 2015.
- C. Li, D. Zhi, K. Wang, and X. Liu, “Metarnn: differentiating rare pathogenic and rare benign missense snvs and indels using deep learning,” Genome Medicine, vol. 14, no. 1, p. 115, 2022.
- P. Evans, C. Wu, A. Lindy, D. A. McKnight, M. Lebo, M. Sarmady, and A. N. Abou Tayoun, “Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets,” Genome Research, vol. 29, no. 7, pp. 1144–1151, 2019.
- A. C. Gunning, V. Fryer, J. Fasham, A. H. Crosby, S. Ellard, E. L. Baple, and C. F. Wright, “Assessing performance of pathogenicity predictors using clinically relevant variant datasets,” Journal of medical genetics, 2020.
- L. Gerasimavicius, B. J. Livesey, and J. A. Marsh, “Loss-of-function, gain-of-function and dominant-negative mutations have profoundly different effects on protein structure,” Nature communications, vol. 13, no. 3895, Jul. 2022.
- J. Fadista, N. Oskolkov, O. Hansson, and L. Groop, “Loftool: a gene intolerance score based on loss-of-function variants in 60 706 individuals,” Bioinformatics, vol. 33, no. 4, pp. 471–474, Aug. 2016.
- P. Tripathi, S. Agarwal, A. N. Sarangi, S. Tewari, and K. Mandal, “Genetic variation in sod1 gene promoter ins/del and its influence on oxidative stress in beta thalassemia major patients,” International Journal of Hematology-Oncology and Stem Cell Research, vol. 14, no. 2, pp. 110–117, Apr. 2020.
- J. Taneera, S. Dhaiban, A. K. Mohammed, D. Mukhopadhyay, H. Aljaibeji, N. Sulaiman, J. Fadista, and A. Salehi, “Gnas gene is an important regulator of insulin secretory capacity in pancreatic β𝛽\betaitalic_β-cells,” Gene, vol. 715, p. 144028, Jul. 2019.
- National Cancer Institute. NCI Dictionary of Genetics Terms. [Online]. Availiable: https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/vus. Accessed: Oct. 2023.
- F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, “Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features,” Computational Statistics, vol. 37, no. 5, pp. 2671–2692, Mar. 2022.
- K. K. Nicodemus and J. D. Malley, “Predictor correlation impacts machine learning algorithms: implications for genomic studies,” Bioinformatics, vol. 25, no. 15, pp. 1884–1890, 2009.
- The BMJ, “11. correlation and regression,” https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression, Oct 2020.
- J. Raymaekers and P. J. Rousseeuw, “Transforming variables to central normality,” Machine Learning, pp. 1–23, 2021.
- M. Zuliani, “Ransac for dummies,” Vision Research Lab, University of California, Santa Barbara, Oct. 2009.
- K. G. Derpanis, “Overview of the ransac algorithm,” Image Rochester NY, vol. 4, no. 1, pp. 2–3, May. 2010.
- D. Curran-Everett, “Explorations in statistics: the log transformation,” Advances in physiology education, vol. 42, no. 2, pp. 343–347, Jun. 2018.
- F. Changyong, W. Hongyue, L. Naiji, C. Tian, H. Hua, L. Ying et al., “Log-transformation and its implications for data analysis,” Shanghai archives of psychiatry, vol. 26, no. 2, p. 105, Apr. 2014.
- C. Feng, H. Wang, N. Lu, and X. M. Tu, “Log transformation: application and interpretation in biomedical research,” Statistics in medicine, vol. 32, no. 2, pp. 230–239, Jul. 2012.
- O. N. Keene, “The log transformation is special,” Statistics in medicine, vol. 14, no. 8, pp. 811–819, Apr. 1995.
- S. Weisberg, “Yeo-johnson power transformations,” Department of Applied Statistics, University of Minnesota. Retrieved June, vol. 1, p. 2003, Oct. 2001.
- I.-K. Yeo and R. A. Johnson, “A new family of power transformations to improve normality or symmetry,” Biometrika, vol. 87, no. 4, pp. 954–959, Dec. 2000.
- Scikit-Learn. 1.13. Feature Selection. [Online]. Availiable: https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/vus. Accessed: Oct. 2023.
- D. Berrar, “Cross-validation.” 2019.