Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning for Efficient GWAS Feature Selection (2312.15055v1)

Published 22 Dec 2023 in q-bio.GN, cs.LG, and stat.ME

Abstract: Genome-Wide Association Studies (GWAS) face unique challenges in the era of big genomics data, particularly when dealing with ultra-high-dimensional datasets where the number of genetic features significantly exceeds the available samples. This paper introduces an extension to the feature selection methodology proposed by Mirzaei et al. (2020), specifically tailored to tackle the intricacies associated with ultra-high-dimensional GWAS data. Our extended approach enhances the original method by introducing a Frobenius norm penalty into the student network, augmenting its capacity to adapt to scenarios characterized by a multitude of features and limited samples. Operating seamlessly in both supervised and unsupervised settings, our method employs two key neural networks. The first leverages an autoencoder or supervised autoencoder for dimension reduction, extracting salient features from the ultra-high-dimensional genomic data. The second network, a regularized feed-forward model with a single hidden layer, is designed for precise feature selection. The introduction of the Frobenius norm penalty in the student network significantly boosts the method's resilience to the challenges posed by ultra-high-dimensional GWAS datasets. Experimental results showcase the efficacy of our approach in feature selection for GWAS data. The method not only handles the inherent complexities of ultra-high-dimensional settings but also demonstrates superior adaptability to the nuanced structures present in genomics data. The flexibility and versatility of our proposed methodology are underscored by its successful performance across a spectrum of experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes, 11(7):717.
  2. Concrete autoencoders for differentiable feature selection and reconstruction. arXiv preprint arXiv:1901.09346.
  3. Deep learning for computational biology. Molecular systems biology, 12(7):878.
  4. Resampling-based tests for lasso in genome-wide association studies. BMC genetics, 18:1–15.
  5. Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11(3):375–386.
  6. Snp selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic epidemiology, 34(8):879–891.
  7. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  8. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).
  9. Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics, 30(2):484–492.
  10. Feature selection for generalized varying coefficient mixed-effect models with application to obesity gwas. The annals of applied statistics, 14(1):276.
  11. Cochran, W. G. (1954). Some methods for strengthening the common χ𝜒\chiitalic_χ 2 tests. Biometrics, 10(4):417–451.
  12. A comparative study on feature selection for a risk prediction model for colorectal cancer. Computer methods and programs in biomedicine, 177:219–229.
  13. Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110(510):630–641.
  14. Snps selection using support vector regression and genetic algorithms in gwas. BMC genomics, 15:1–15.
  15. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. The American Journal of Human Genetics, 75(1):35–43.
  16. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5):849–911.
  17. Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing, 312:310–323.
  18. Afs: An attention-based mechanism for supervised feature selection. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3705–3713.
  19. Autoencoder inspired unsupervised feature selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2941–2945. IEEE.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  21. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
  22. Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584.
  23. Lassonet: A neural network with feature sparsity. The Journal of Machine Learning Research, 22(1):5633–5661.
  24. Gwasimulator: a rapid whole-genome simulation program. Bioinformatics, 24(1):140–142.
  25. Predicting and analyzing early wake-up associated gene expressions by integrating gwas and eqtl studies. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 1864(6):2241–2246.
  26. Li, K. (2022). Variable selection for nonlinear cox regression model via deep learning. arXiv preprint arXiv:2211.09287.
  27. Calibrating multi-dimensional complex ode from noisy data via deep neural networks. arXiv preprint arXiv:2106.03591.
  28. Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. Neurocomputing, 538:126186.
  29. Semiparametric regression for spatial data via deep learning. arXiv preprint arXiv:2301.03747.
  30. Deep feature selection: theory and application to identify enhancers and promoters. Journal of Computational Biology, 23(5):322–336.
  31. Deep neural networks for high dimension, low sample size data. In IJCAI, pages 2287–2293.
  32. Deeppink: reproducible feature selection in deep neural networks. Advances in neural information processing systems, 31.
  33. Deep feature selection using a teacher-student network. Neurocomputing, 383:396–408.
  34. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2:927312.
  35. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
  36. Fsnet: Feature selection network on high-dimensional biological data. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE.
  37. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
  38. A review of unsupervised feature selection methods. Artificial Intelligence Review, 53(2):907–948.
  39. Feature selection methods and genomic big data: a systematic review. Journal of Big Data, 6(1):1–24.
  40. Novel unsupervised feature filtering of biological data. Bioinformatics, 22(14):e507–e513.
  41. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721.
  42. Prioritizing genetic variants in gwas with lasso using permutation-assisted tuning. Bioinformatics, 36(12):3811–3817.
  43. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR), 52(1):1–38.
  44. Heterogeneous feature selection with multi-modal deep neural networks and sparse group lasso. IEEE Transactions on Multimedia, 17(11):1936–1948.

Summary

We haven't generated a summary for this paper yet.