Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning for Genomics: A Concise Overview (1802.00810v4)

Published 2 Feb 2018 in q-bio.GN and cs.LG

Abstract: Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (288)
  1. Dncon2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics, 34(9):1466–1472, 2018.
  2. Transfer learning for class imbalance problems with inadequate data. Knowledge and information systems, 48(1):201–228, 2016.
  3. Predicting the sequence specificities of dna- and rna-binding proteins by deep learning. Nat Biotech, 33(8):831–838, Aug 2015. ISSN 1087-0156. URL http://dx.doi.org/10.1038/nbt.3300. Computational Biology.
  4. Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
  5. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
  6. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.
  7. Structural classification of proteins and structural genomics: new insights into protein folding and evolution. Acta Crystallographica Section F: Structural Biology and Crystallization Communications, 66(10):1190–1197, 2010.
  8. Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biology, 18(1):67, Apr 2017. ISSN 1474-760X. doi: 10.1186/s13059-017-1189-z. URL https://doi.org/10.1186/s13059-017-1189-z.
  9. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.
  10. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.
  11. Population structure and cryptic relatedness in genetic association studies. Statistical Science, 24(4):451–471, 2009.
  12. Effective gene expression prediction from sequence by integrating long-range interactions. bioRxiv, 2021. doi: 10.1101/2021.04.07.438649. URL https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.
  13. Genetic changes shaping the human brain. Developmental cell, 32(4):423–434, 2015.
  14. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999.
  15. Deciphering the splicing code. Nature, 465(7294):53–59, 2010.
  16. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603–607, 2012.
  17. Predicting gene expression from sequence. Cell, 117(2):185–198, 2004.
  18. Dna language models are powerful zero-shot predictors of non-coding variant effects. 2022.
  19. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.
  20. Deepnd: Deep multitask learning of gene risk for comorbid neurodevelopmental disorders. Patterns, 3(7), 2022.
  21. Protein secondary structure and homology by neural networks the α𝛼\alphaitalic_α-helices in rhodopsin. FEBS letters, 241(1-2):223–228, 1988.
  22. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Briefings in Bioinformatics, 23(1):bbab434, 11 2021. ISSN 1477-4054. doi: 10.1093/bib/bbab434. URL https://doi.org/10.1093/bib/bbab434.
  23. Deepnano: Deep recurrent neural networks for base calling in minion nanopore reads. PloS one, 12(6):e0178751, 2017.
  24. Protein structure, modelling and applications. 2007.
  25. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  26. Protein secondary structure prediction using deep multi-scale convolutional neural networks and next-step conditioning. arXiv preprint arXiv:1611.01503, 2016.
  27. Gene expression differences among primates are associated with changes in a histone epigenetic modification. Genetics, 187(4):1225–1234, 2011.
  28. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLOS Computational Biology, 13(7):1–27, 07 2017. doi: 10.1371/journal.pcbi.1005690. URL https://doi.org/10.1371/journal.pcbi.1005690.
  29. A topological approach for protein classification. Molecular Based Mathematical Biology, 3(1), 2015.
  30. Learning functional embedding of genes governed by pair-wised labels. In Computational Intelligence and Applications (ICCIA), 2017 2nd IEEE International Conference on, pages 397–401. IEEE, 2017a.
  31. Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics, 31(12):i116–i123, 2015.
  32. Deepqa: improving the estimation of single protein model quality with deep belief networks. BMC bioinformatics, 17(1):495, 2016.
  33. Prolango: Protein function prediction using neural machine translation based on a recurrent neural network. Molecules, 22(10):1732, 2017b.
  34. gkm-dnn: efficient prediction using gapped k-mer features and deep neural networks. bioRxiv, page 170761, 2017.
  35. Davide Castelvecchi. Can we open the black box of ai? Nature, 2016. URL https://www.nature.com/news/can-we-open-the-black-box-of-ai-1.20731?goal=0_997ed6f472-4f78184f7e-154333457&mc_cid=4f78184f7e&mc_eid=74910b9383.
  36. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023. URL https://api.semanticscholar.org/CorpusID:259502990.
  37. Interpretable attention model in transcription factor binding site prediction with deep neural networks. bioRxiv, page 648691, 2019.
  38. Predicting transcription factor binding sites with convolutional kernel networks. bioRxiv, page 217257, 2017.
  39. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in bioinformatics, page bbw108, 2016a.
  40. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC bioinformatics, 17(1):S9, 2016b.
  41. Gene expression inference with deep learning. Bioinformatics, 32(12):1832–1839, 2016c.
  42. A statistical framework for modeling gene expression using chromatin features and application to modencode datasets. Genome biology, 12(2):R15, 2011.
  43. Opportunities and obstacles for deep learning in biology and medicine. bioRxiv, page 142760, 2017.
  44. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  45. methcancer-gen: a dna methylome dataset generator for user-specified cancer type based on conditional variational autoencoder. BMC Bioinformatics, 21, 05 2020. doi: 10.1186/s12859-020-3516-8.
  46. A graphical model for protein secondary structure prediction. In Proceedings of the twenty-first international conference on Machine learning, page 21. ACM, 2004.
  47. Transfer learning for latin and chinese characters with deep neural networks. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1–6. IEEE, 2012.
  48. Enhancer identification using transfer and adversarial deep learning of dna sequences. bioRxiv, page 264200, 2018.
  49. ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57–74, 2012.
  50. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv, pages 2023–04, 2023.
  51. Transformer-xl: Attentive language models beyond a fixed-length context. ArXiv, abs/1901.02860, 2019. URL https://api.semanticscholar.org/CorpusID:57759363.
  52. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023.
  53. A deep learning approach for cancer detection and relevant gene identification. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017, pages 219–229. World Scientific, 2017.
  54. Deep modeling of gene expression regulation in an erythropoiesis model. In Representation Learning, ICML Workshop, 2013.
  55. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  56. Deepprofile: Deep learning of patient molecular profiles for precision medicine in acute myeloid leukemia. bioRxiv, page 278739, 2018.
  57. The correlation between histone modifications and gene expression. 2013.
  58. Modeling gene expression using chromatin features in various cellular contexts. Genome biology, 13(9):R53, 2012.
  59. Glm: General language model pretraining with autoregressive blank infilling. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:247519241.
  60. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  61. Prottrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. bioRxiv, 2020. URL https://api.semanticscholar.org/CorpusID:220495861.
  62. Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. Journal of molecular biology, 300(4):1005–1016, 2000.
  63. Spine x: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. Journal of computational chemistry, 33(3):259–267, 2012.
  64. Eukaryotic promoter recognition. Genome research, 7(9):861–878, 1997.
  65. Discover regulatory dna elements using chromatin signatures and artificial neural network. Bioinformatics, 26(13):1579–1586, 2010.
  66. Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures. Nucleic Acids Research, 42:D304 – D309, 2013. URL https://api.semanticscholar.org/CorpusID:14864309.
  67. Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3-4):121–136, 1975.
  68. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer, 1982.
  69. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  70. A deep learning approach to identify gene targets of a therapeutic for human splicing disorders. Nature Communications, 12(1):3332, 2021. doi: 10.1038/s41467-021-23663-2. URL https://doi.org/10.1038/s41467-021-23663-2.
  71. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.
  72. Designing interpretable convolution-based hybrid networks for genomics. bioRxiv, pages 2021–07, 2021.
  73. Methods for biological data integration: perspectives and challenges. Journal of the Royal Society Interface, 12(112):20150571, 2015.
  74. Deep learning. MIT press, 2016.
  75. Learning structure in gene expression data using deep architectures, with an application to gene clustering. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages 1328–1335. IEEE, 2015.
  76. A hybrid deep learning approach for COVID-19 detection based on genomic image processing techniques. Scientific Reports, 13(1):4003, March 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-30941-0. URL https://www.nature.com/articles/s41598-023-30941-0.
  77. Probabilistic models of genetic variation in structured populations applied to global human studies. Bioinformatics, 32(5):713–721, 2015.
  78. Cancer survival prediction by learning comprehensive deep feature representation for multiple types of genetic data - bmc bioinformatics, Jun 2023. URL https://doi.org/10.1186/s12859-023-05392-z.
  79. Imbalanced class learning in epigenetics. Journal of Computational Biology, 21(7):492–507, 2014.
  80. Detecting and sorting targeting peptides with neural networks and support vector machines. Journal of bioinformatics and computational biology, 4(01):1–18, 2006.
  81. Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons, 2013.
  82. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  83. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  84. Learning and releaming in boltzmann machines. Parallel Distrilmted Processing, 1, 1986.
  85. Combinatorial roles of dna methylation and histone modifications on gene expression. In Some Current Advanced Researches on Information and Computer Science in Vietnam, pages 123–135. Springer, 2015.
  86. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  87. Fast model-based protein homology detection without alignment. Bioinformatics, 23(14):1728–1736, 2007.
  88. Protein secondary structure prediction with a neural network. Proceedings of the National Academy of Sciences, 86(1):152–156, 1989.
  89. An assessment of neural network and statistical approaches for prediction of e. coli promoter sites. Nucleic Acids Research, 20(16):4331–4338, 1992.
  90. Deepsf: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 2017.
  91. Protein tertiary structure modeling driven by deep learning and contact distance prediction in casp13. Proteins: Structure, Function, and Bioinformatics, 87(12):1165–1178, 2019.
  92. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. Journal of molecular biology, 308(2):397–407, 2001.
  93. Comparative protein structure modeling and its applications to drug discovery. 2004.
  94. The Jackson Laboratory JAX. Genetics vs. genomics, 2018. URL https://www.jax.org/personalized-medicine/precision-medicine-and-you/genetics-vs-genomics.
  95. Integrative deep models for alternative splicing. bioRxiv, page 104869, 2017.
  96. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  97. David T Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of molecular biology, 292(2):195–202, 1999.
  98. Mechanisms in endocrinology: alternative splicing: the new frontier in diabetes research. European journal of endocrinology, 174(5):R225–R238, 2016.
  99. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  100. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577–2637, 1983.
  101. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4):348–354, 2010.
  102. Histone modification levels are predictive for gene expression. Proceedings of the National Academy of Sciences, 107(7):2926–2931, 2010.
  103. Functional annotation of a full-length mouse cdna collection. Nature, 409(6821):685–690, 2001.
  104. David R Kelley. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol., 16(7):e1008050, July 2020.
  105. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res, 26(7):990–999, Jul 2016. ISSN 1088-9051. doi: 10.1101/gr.200535.115. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937568/. 27197224[pmid].
  106. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res., 28(5):739–750, May 2018.
  107. Pixels that sound. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 88–95. IEEE, 2005.
  108. Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 16(8):553–560, 2003.
  109. Distributed representations for biological sequence analysis. arXiv preprint arXiv:1608.05949, 2016.
  110. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  111. Deep: a general computational framework for predicting enhancers. Nucleic acids research, 43(1):e6–e6, 2014.
  112. Improvements in protein secondary structure prediction by an enhanced neural network. Journal of molecular biology, 214(1):171–182, 1990.
  113. Self-supervised deep learning encodes high-resolution features of protein subcellular localization. Nature Methods, 19(8):995–1003, 2022. doi: 10.1038/s41592-022-01541-z. URL https://doi.org/10.1038/s41592-022-01541-z.
  114. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  115. Protein structure prediction and model quality assessment. Drug discovery today, 14(7-8):386–393, 2009.
  116. Critical assessment of methods of protein structure prediction (casp)—round xiv. Proteins: Structure, Function, and Bioinformatics, 89(12):1607–1617, 2021.
  117. Class lecture, cs 273b: Deep learning in genomics and biomedicine. Department of Computer Science, Stanford University, 2016. URL https://canvas.stanford.edu/courses/51037.
  118. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.
  119. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. science, 313(5795):1929–1935, 2006.
  120. Deep motif: Visualizing genomic sequence classifications. arXiv preprint arXiv:1605.01133, 2016a.
  121. Deep gdashboard: Visualizing and understanding genomic sequences using deep neural networks. CoRR, abs/1608.03644, 2016b. URL http://arxiv.org/abs/1608.03644.
  122. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001.
  123. Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology, volume 487, pages 545–574. Elsevier, 2011.
  124. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
  125. Deep learning. Nature, 521(7553):436–444, 2015.
  126. Boosted categorical restricted boltzmann machine for computational prediction of splice junctions. In International Conference on Machine Learning, pages 2483–2492, 2015.
  127. Deep spatio-temporal architectures and learning for protein structure prediction. In Advances in neural information processing systems, pages 512–520, 2012.
  128. Deep learning of the tissue-regulated splicing code. Bioinformatics, 30(12):i121–i129, 2014.
  129. Machine learning in genomic medicine: a review of computational problems and data sets. Proceedings of the IEEE, 104(1):176–197, 2016.
  130. The identification of cis-regulatory elements: A review from a machine learning perspective. Biosystems, 138:6–17, 2015a.
  131. Deep feature selection: Theory and application to identify enhancers and promoters. In RECOMB, pages 205–217, 2015b.
  132. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. bioRxiv, page 041616, 2016a.
  133. A review on machine learning principles for multi-view biological data integration. Briefings in bioinformatics, page bbw113, 2016b.
  134. Multi-view representation learning: A survey from shallow methods to deep methods. arXiv preprint arXiv:1610.01206, 2016c.
  135. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176, 2016.
  136. Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 12(4):928–937, 2015.
  137. Li Liao and William Stafford Noble. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of computational biology, 10(6):857–868, 2003.
  138. Maxwell Wing Libbrecht. Understanding human genome regulation through entropic graph-based regularization and submodular optimization. PhD thesis, 2016.
  139. Defining the chromatin signature of inducible genes in t cells. Genome biology, 10(10):R107, 2009.
  140. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  141. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833–835, 2011.
  142. Protein remote homology detection based on bidirectional long short-term memory. BMC bioinformatics, 18(1):443, 2017.
  143. Pedla: predicting enhancers with a deep learning-based algorithmic framework. Scientific reports, 6:28517, 2016a.
  144. Improving protein tertiary structure prediction by deep learning and distance prediction in casp14. Proteins: Structure, Function, and Bioinformatics, 90(1):58–72, 2022.
  145. Benchmarking deep networks for predicting residue-specific quality of individual protein models in casp11. Scientific reports, 6:19301, 2016b.
  146. Scop: a structural classification of proteins database. Nucleic acids research, 28(1):257–259, 2000.
  147. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6449–6459, 2017.
  148. Transfer learning using computational intelligence: a survey. Knowledge-Based Systems, 80:14–23, 2015.
  149. Hidden-unit conditional random fields. In International Conference on Artificial Intelligence and Statistics, pages 479–488, 2011.
  150. Sspro/accpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics, 30(18):2592–2597, 2014.
  151. Detection of rna polymerase ii promoters and polyadenylation sites in human dna sequence. Computers & chemistry, 20(1):135–140, 1996.
  152. Suyu Mei. Probability weighted ensemble transfer learning for predicting interactions between hiv-1 and human proteins. PLoS One, 8(11):e79606, 2013.
  153. Computational prediction of protein subcellular locations in eukaryotes: an experience report. Computational Molecular Biology, 2(1), 2012.
  154. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
  155. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b.
  156. Deep learning in bioinformatics. Briefings in bioinformatics, 18(5):851–869, 2017.
  157. Deepenhancer: Predicting enhancers by convolutional neural networks. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on, pages 637–644. IEEE, 2016.
  158. Marit Mitchell. Deep genomics applies machine learning to develop new genetic medicines, 2017. URL http://news.engineering.utoronto.ca/deep-genomics-applies-machine-learning-develop-new-genetic-medicines/.
  159. Multimodal transfer deep learning with applications in audio-visual recognition. arXiv preprint arXiv:1412.3121, 2014.
  160. Sclpred: protein subcellular localization prediction by n-to-1 neural networks. Bioinformatics, 27(20):2812–2819, 2011.
  161. Nature. Gene expression. Nature Education, 2010. URL https://www.nature.com/scitable/nated/topicpage/gene-expression-14121669.
  162. Patrick Ng. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279, 2017.
  163. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution, 2023.
  164. Dl-pro: A novel deep learning method for protein model quality assessment. In Neural Networks (IJCNN), 2014 International Joint Conference on, pages 2071–2078. IEEE, 2014.
  165. Improved metagenome binning and assembly using deep variational autoencoders. Nature biotechnology, 39(5):555–560, 2021.
  166. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  167. The use of class imbalanced learning methods on ulsam data to predict the case-control status in genome-wide association studies. bioRxiv, 2023. doi: 10.1101/2023.01.05.522884. URL https://www.biorxiv.org/content/early/2023/01/06/2023.01.05.522884.
  168. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  169. Rna-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC bioinformatics, 18(1):136, 2017.
  170. Protein function classification based on gene ontology. Information Retrieval Technology, pages 691–696, 2005.
  171. Accurate classification of protein subcellular localization from high-throughput microscopy images using deep learning. G3: Genes, Genomes, Genetics, 7(5):1385–1392, 2017.
  172. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Sciences, 37(4):205–211, 1951.
  173. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.
  174. Bacello: a balanced subcellular localization predictor. Bioinformatics, 22(14):e408–e416, 2006.
  175. Hyena hierarchy: Towards larger convolutional language models, 2023.
  176. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics, 47(2):228–235, 2002.
  177. Semi-supervised multi-task learning for predicting interactions between hiv-1 and human proteins. Bioinformatics, 26(18):i645–i652, 2010.
  178. Predicting the secondary structure of globular proteins using neural network models. Journal of molecular biology, 202(4):865–884, 1988.
  179. Imputation for transcription factor binding predictions based on deep learning. PLoS computational biology, 13(2):e1005403, 2017.
  180. Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic Acids Res, 44(11):e107–e107, Jun 2016. ISSN 0305-1048. doi: 10.1093/nar/gkw226. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914104/. 27084946[pmid].
  181. Hybrid deep neural network for handling data imbalance in precursor microrna. Frontiers in Public Health, 9, 2021. doi: 10.3389/fpubh.2021.821410.
  182. Improving language understanding by generative pre-training. 2018.
  183. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  184. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019. URL https://api.semanticscholar.org/CorpusID:204838007.
  185. Dr. vae: Drug response variational autoencoder. arXiv preprint arXiv:1706.08203, 2017.
  186. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239–4247, 2005.
  187. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics, 37(11):1535–1543, 2021.
  188. Improved model quality assessment using proq2. BMC bioinformatics, 13(1):224, 2012.
  189. Deep generative models of genetic variation capture mutation effects. arXiv preprint arXiv:1712.06527, 2017.
  190. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833–840. Omnipress, 2011.
  191. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. Journal of Computational Biology, 3(1):163–183, 1996.
  192. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118, 2019. URL https://api.semanticscholar.org/CorpusID:155162335.
  193. Prediction of protein secondary structure at better than 70% accuracy. Journal of molecular biology, 232(2):584–599, 1993a.
  194. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proceedings of the National Academy of Sciences, 90(16):7558–7562, 1993b.
  195. Redefining the goals of protein secondary structure prediction. Journal of molecular biology, 235(1):13–26, 1994.
  196. Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  197. Alphafold and implications for intrinsically disordered proteins. Journal of Molecular Biology, 433(20):167208, 2021.
  198. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  199. Bayesian segmentation of protein secondary structure. Journal of computational biology, 7(1-2):233–248, 2000.
  200. Nucleotide sequence and dnasei sensitivity are predictive of 3d chromatin architecture. bioRxiv, page 103614, 2017.
  201. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
  202. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
  203. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In Advances in Neural Information Processing Systems, pages 1433–1440, 2009.
  204. Improved protein structure prediction using potentials from deep learning. Nature, 577:706–710, 2020. URL https://api.semanticscholar.org/CorpusID:210221987.
  205. Seqgl identifies context-dependent binding signals in genome-wide regulatory element maps. PLoS computational biology, 11(5):e1004271, 2015.
  206. Multi-task multi-modal learning for joint diagnosis and prognosis of human cancers. Medical Image Analysis, 65:101795, 2020. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2020.101795. URL https://www.sciencedirect.com/science/article/pii/S1361841520301596.
  207. Deep genomic signature for early metastasis prediction in prostate cancer. bioRxiv, page 276055, 2018.
  208. Sherloc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics, 23(11):1410–1417, 2007. doi: 10.1093/bioinformatics/btm115. URL +http://dx.doi.org/10.1093/bioinformatics/btm115.
  209. A deep learning model for rna-protein binding preference prediction based on hierarchical lstm and attention network. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 19(2):753–762, jul 2020. ISSN 1545-5963. doi: 10.1109/TCBB.2020.3007544. URL https://doi.org/10.1109/TCBB.2020.3007544.
  210. Prediction of local quality of protein structure models considering spatial neighbors in graphical models. Scientific reports, 7:40629, 2017.
  211. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv, page 103663, 2017.
  212. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  213. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016a.
  214. Attend and predict: Understanding gene regulation by selective attention on chromatin. Advances in neural information processing systems, 30, 2017.
  215. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. bioRxiv, page 085241, 2016b.
  216. Convolutional lstm networks for subcellular localization of proteins. In International Conference on Algorithms for Computational Biology, pages 68–80. Springer, 2015.
  217. Testing for genetic associations in arbitrarily structured populations. Nature genetics, 47(5):550, 2015.
  218. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM transactions on computational biology and bioinformatics, 12(1):103–112, 2015.
  219. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, pages 1–4, 2018. URL https://api.semanticscholar.org/CorpusID:92596540.
  220. Big data: astronomical or genomical? PLoS biology, 13(7):e1002195, 2015.
  221. Amy O Stevens and Yi He. Benchmarking the accuracy of alphafold 2 in loop structure prediction. Biomolecules, 12(7):985, 2022.
  222. Gary D Stormo. Dna binding sites: representation and discovery. Bioinformatics, 16(1):16–23, 2000.
  223. Image-level and group-level models for drosophila gene expression pattern annotation. BMC bioinformatics, 14(1):350, 2013.
  224. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31:926 – 932, 2014. URL https://api.semanticscholar.org/CorpusID:12423917.
  225. Introduction to multi-layer feed-forward neural networks. Chemometrics and intelligent laboratory systems, 39(1):43–62, 1997.
  226. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. In Pacific Symposium on Biocomputing Co-Chairs, pages 132–143. World Scientific, 2014.
  227. Adage-based integration of publicly available pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems, 1(1):e00025–15, 2016.
  228. Unsupervised extraction of stable expression signatures from public compendia with eadage. bioRxiv, page 078659, 2017.
  229. 3d deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinformatics, 18(1):302, Jun 2017. ISSN 1471-2105. doi: 10.1186/s12859-017-1702-0. URL https://doi.org/10.1186/s12859-017-1702-0.
  230. Training genotype callers with neural networks. bioRxiv, page 097469, 2016.
  231. Implicit causal models for genome-wide association studies. arXiv preprint arXiv:1710.10742, 2017.
  232. Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. International Journal of Molecular Sciences, 23(20):12272, October 2022. ISSN 1422-0067. doi: 10.3390/ijms232012272. URL https://www.mdpi.com/1422-0067/23/20/12272.
  233. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PloS one, 12(2):e0171410, 2017.
  234. Deep learning to analyze rna-seq gene expression data. In International Work-Conference on Artificial Neural Networks, pages 50–59. Springer, 2017.
  235. Proq3: Improved model quality assessments using rosetta energy terms. Scientific reports, 6:33509, 2016.
  236. Proq3d: improved model quality assessments using deep learning. Bioinformatics, 33(10):1578–1580, 2017.
  237. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  238. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
  239. Machine learning for protein subcellular localization prediction. Walter de Gruyter GmbH & Co KG, 2015.
  240. Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies. bioRxiv, page 228106, 2017a.
  241. Extracting compact representation of knowledge from gene expression data for protein-protein interaction. International Journal of Data Mining and Bioinformatics, 17(4):279–292, 2017b.
  242. Select-additive learning: Improving generalization in multimodal sentiment analysis. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pages 949–954. IEEE, 2017c.
  243. On the origin of deep learning. arXiv preprint arXiv:1702.07800, 2017d.
  244. Contact-distil: Boosting low homologous protein contact map prediction by self-supervised distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4620–4627, 2022.
  245. Protein secondary structure prediction using deep convolutional neural fields. Sci Rep, 6:18962, Jan 2016a. ISSN 2045-2322. doi: 10.1038/srep18962. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4707437/. 26752681[pmid].
  246. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS computational biology, 13(1):e1005324, 2017e.
  247. Predicting dna methylation state of cpg dinucleotide using genome topological features and deep networks. 6:19598 EP –, Jan 2016b. URL http://dx.doi.org/10.1038/srep19598. Article.
  248. GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction. Bioinformatics, 37(18):2963–2970, 03 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab185. URL https://doi.org/10.1093/bioinformatics/btab185.
  249. Secondary structure prediction with support vector machines. Bioinformatics, 19(13):1650–1655, 2003.
  250. Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 5(4):276–287, 2004.
  251. Molecular structure of nucleic acids. Nature, 171(4356):737–738, 1953.
  252. Evaluating deep variational autoencoders trained on pan-cancer gene expression. arXiv preprint arXiv:1711.04828, 2017a.
  253. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. bioRxiv, page 174474, 2017b.
  254. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.
  255. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016.
  256. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. bioRxiv, pages 2022–11, 2022.
  257. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature genetics, 48(5):488, 2016.
  258. Multitask learning in computational biology. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 207–216, 2012.
  259. High-resolution de novo structure prediction from primary sequence. BioRxiv, pages 2022–07, 2022.
  260. Deepdist: real-value inter-residue distance prediction with deep residual convolutional network. BMC bioinformatics, 22:1–17, 2021.
  261. Fair deep learning prediction for healthcare applications with confounder filtering. arXiv preprint arXiv:1803.07276, 2018.
  262. A deep auto-encoder model for gene expression prediction. BMC genomics, 18(9):845, 2017.
  263. The human splicing code reveals new insights into the genetic determinants of disease. Science, 347(6218):1254806–1254806, Jan 2015. ISSN 0036-8075. doi: 10.1126/science.1254806. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4362528/. 25525159[pmid].
  264. Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context. Bioinformatics, 27(18):2554–2562, 2011.
  265. A survey of transfer and multitask learning in bioinformatics. Journal of Computing Science and Engineering, 5(3):257–268, 2011.
  266. Dual convolutional neural networks with attention mechanisms based method for predicting disease-related lncrna genes. Frontiers in genetics, 10:416, 2019.
  267. Biren: predicting enhancers with a deep-learning-based model using the dna sequence alone. Bioinformatics, 33(13):1930–1936, 2017.
  268. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022.
  269. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46(2):100–106, 2014.
  270. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3):1496–1503, 2020.
  271. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(D1):D955–D961, 2013. doi: 10.1093/nar/gks1111. URL +http://dx.doi.org/10.1093/nar/gks1111.
  272. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763–774, 2001.
  273. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In Hybrid Intelligent Systems, 2005. HIS’05. Fifth International Conference on, pages 6–pp. IEEE, 2005.
  274. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics, 38(2):203–208, 2006.
  275. Predicting gene expression from sequence: a reexamination. PLoS computational biology, 3(11):e243, 2007.
  276. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  277. A modified definition of sov, a segment-based measure for protein secondary structure prediction assessment. Proteins: Structure, Function, and Bioinformatics, 34(2):220–223, 1999.
  278. Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121–i127, 2016.
  279. Template-based prediction of protein structure with deep learning. BMC genomics, 21(11):1–9, 2020.
  280. Enhancing the protein tertiary structure prediction by multiple sequence alignment generation. arXiv preprint arXiv:2306.01824, 2023.
  281. A deep learning framework for modeling structural features of rna-binding protein targets. Nucleic acids research, 44(4):e32–e32, 2015.
  282. Deep model based transfer and multi-task learning for biological image analysis. IEEE Transactions on Big Data, 2016.
  283. Hicplus: Resolution enhancement of hi-c interaction heatmap. bioRxiv, page 112631, 2017.
  284. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In International Conference on Machine Learning, pages 745–753, 2014.
  285. Predicting effects of noncoding variants with deep learning-based sequence model. Nature methods, 12(10):931–934, 2015.
  286. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature Genetics, 50(8):1171–1179, August 2018. ISSN 1061-4036. doi: 10.1038/s41588-018-0160-6.
  287. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
  288. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics. bioRxiv, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianwei Yue (7 papers)
  2. Yuanxin Wang (6 papers)
  3. Longxiang Zhang (6 papers)
  4. Chunming Gu (2 papers)
  5. Haoru Xue (11 papers)
  6. Wenping Wang (184 papers)
  7. Qi Lyu (8 papers)
  8. Yujie Dun (5 papers)
Citations (89)

Summary

We haven't generated a summary for this paper yet.