Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding (2405.09585v3)

Published 15 May 2024 in cs.LG and cs.AI

Abstract: Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of holstein final score. Journal of dairy science, 93(2):743–752, 2010.
  2. Genomic selection in french dairy cattle. Animal Production Science, 52(3):115–120, 2012.
  3. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  4. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics, 182(1):375–385, 2009.
  5. Jeffrey B Endelman. Ridge regression and other kernels for genomic selection with r package rrblup. The plant genome, 4(3), 2011.
  6. Changes in genetic selection differentials and generation intervals in us holstein dairy cattle as a result of genomic selection. Proceedings of the National Academy of Sciences, 113(28):E3995–E4004, 2016.
  7. Using the genomic relationship matrix to predict the accuracy of genomic selection. Journal of animal breeding and genetics, 128(6):409–421, 2011.
  8. Performances of adaptive multiblup, bayesian regressions, and weighted-gblup approaches for genomic predictions in belgian blue beef cattle. BMC genomics, 21(1):1–18, 2020.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  10. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  11. Andrew D Johnson. An extended iupac nomenclature code for polymorphic nucleic acids. Bioinformatics, 26(10):1386–1389, 2010.
  12. Back to basics for bayesian model building in genomic selection. Genetics, 191(3):969–987, 2012.
  13. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
  14. A method for stochastic optimization. In International conference on learning representations (ICLR), volume 5, page 6. San Diego, California;, 2015.
  15. Patterns and power of phenotypic selection in nature. Bioscience, 57(7):561–572, 2007.
  16. hibayes: An r package to fit individual-level, summary-level and single-step bayesian regression models for genomic prediction and genome-wide association studies. bioRxiv, 2022.
  17. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Frontiers in genetics, 10:1091, 2019.
  18. Deepgs: Predicting phenotypes from genotypes using deep learning. BioRxiv, page 241414, 2017.
  19. Predicting disease risk using bootstrap ranking and classification algorithms. PLoS computational biology, 9(8):e1003200, 2013.
  20. Prediction of total genetic value using genome-wide dense marker maps. genetics, 157(4):1819–1829, 2001.
  21. Influence of feature encoding and choice of classifier on disease risk prediction in genome-wide association studies. PloS one, 10(8):e0135832, 2015.
  22. Uncovering genomic regions controlling plant architectural traits in hexaploid wheat using different gwas models. Scientific reports, 11(1):6767, 2021.
  23. Mapping of soil erosion susceptibility using advanced machine learning models at nghe an, vietnam. Journal of Hydroinformatics, 26(1):72–87, 2024.
  24. Transposable elements in plants: Recent advancements, tools and prospects. Plant Molecular Biology Reporter, 40(4):628–645, 2022.
  25. Marker-assisted selection: new tools and strategies. Trends in Plant Science, 3(6):236–239, 1998.
  26. The spatial patterns of directional phenotypic selection. Ecology letters, 16(11):1382–1392, 2013.
  27. Fast marginal likelihood maximisation for sparse bayesian models. In International workshop on artificial intelligence and statistics, pages 276–283. PMLR, 2003.
  28. United Nations. 17 sustainable development goals. https://sdgs.un.org/goals, 2023.
  29. United Nations. Leave no one behind. https://unsdg.un.org/2030-agenda/universal-values/leave-no-one-behind, 2023.
  30. United Nations. The State of Food Security and Nutrition in the World 2023. FAO, Rome, 2023. Urbanization, agrifood systems transformation and healthy diets across the rural–urban continuum.
  31. Paul M VanRaden. Efficient methods to compute genomic predictions. Journal of dairy science, 91(11):4414–4423, 2008.
  32. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  33. Genomic variation in 3,010 diverse accessions of asian cultivated rice. Nature, 557(7703):43–49, 2018.
  34. Dnngp, a deep neural network-based method for genomic prediction using multi-omics data in plants. Molecular Plant, 16(1):279–293, 2023.
  35. A transformer-based genomic prediction method fused with knowledge-guided module. Briefings in Bioinformatics, 25(1):bbad438, 2024.
  36. Residual networks without pooling layers improve the accuracy of genomic predictions. 2023.
  37. Marker-assisted selection in plant breeding: From publications to practice. Crop science, 48(2):391–407, 2008.
  38. HIBLUP: an integration of statistical models on the BLUP framework for efficient genetic evaluation using big genomic data. Nucleic Acids Research, 51(8):3501–3512, 02 2023.
  39. Dnagpt: A generalized pretrained tool for multiple dna sequence analysis tasks. bioRxiv, pages 2023–07, 2023.
  40. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.
  41. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com