Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled Zero-shot Genome Classification (2401.13219v1)

Published 24 Jan 2024 in q-bio.GN, cs.AI, and cs.LG

Abstract: A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information that aids in species recognition, taxonomic classification, and understanding genetic predispositions like drug resistance and virulence. However, the vast number of potential species poses significant challenges in developing a general-purpose whole genome classification tool. Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive. Machine learning-based frameworks show promise but must address the issue of large classification vocabularies with long-tail distributions. In this study, we propose addressing this problem through zero-shot learning using TEPI, Taxonomy-aware Embedding and Pseudo-Imaging. We represent each genome as pseudo-images and map them to a taxonomy-aware embedding space for reasoning and classification. This embedding space captures compositional and phylogenetic relationships of species, enabling predictions in extensive search spaces. We evaluate TEPI using two rigorous zero-shot settings and demonstrate its generalization capabilities qualitatively on curated, large-scale, publicly sourced data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. J. Brinton, R. H. Ramirez-Gonzalez, J. Simmonds, L. Wingen, S. Orford, S. Griffiths, . W. G. Project, G. Haberer, M. Spannagl, S. Walkowiak et al., “A haplotype-led approach to increase the precision of wheat breeding,” Communications Biology, vol. 3, no. 1, p. 712, 2020.
  2. A. Criscuolo, “A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies,” Research Ideas and Outcomes, vol. 5, p. e36178, 2019.
  3. C. Köser, M. Ellington, E. Cartwright, S. Gillespie, N. Brown, M. Farrington, M. Holden, G. Dougan, S. Bentley, J. Parkhill et al., “Routine use of microbial whole genome sequencing in diagnostic and public health microbiology.” Plos Pathogens, vol. 8, no. 8, pp. e1 002 824–e1 002 824, 2012.
  4. S. Razin, D. Yogev, and Y. Naot, “Molecular biology and pathogenicity of mycoplasmas,” Microbiology and Molecular Biology Reviews, vol. 62, no. 4, pp. 1094–1156, 1998.
  5. J. Ye, S. McGinnis, and T. L. Madden, “Blast: improvements for better sequence analysis,” Nucleic Acids Research, vol. 34, no. suppl2, pp. W6–W9, 2006.
  6. J. Liu, W.-C. Chang, Y. Wu, and Y. Yang, “Deep learning for extreme multi-label text classification,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 115–124.
  7. T. P. Curtis, W. T. Sloan, and J. W. Scannell, “Estimating prokaryotic diversity and its limits,” Proceedings of the National Academy of Sciences, vol. 99, no. 16, pp. 10 494–10 499, 2002.
  8. Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  9. Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4582–4591.
  10. S. N. Aakur, V. Indla, V. Indla, S. Narayanan, A. Bagavathi, V. L. Ramnath, and A. Ramachandran, “Metagenome2vec: Building contextualized representations for scalable metagenome analysis,” in 2021 International Conference on Data Mining Workshops.   IEEE, 2021, pp. 500–507.
  11. S. N. Aakur, S. Narayanan, V. Indla, A. Bagavathi, V. Laguduva Ramnath, and A. Ramachandran, “Mg-net: leveraging pseudo-imaging for multi-modal metagenome analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2021, pp. 592–602.
  12. D. E. Wood and S. L. Salzberg, “Kraken: ultrafast metagenomic sequence classification using exact alignments,” Genome Biology, vol. 15, no. 3, pp. 1–12, 2014.
  13. D. E. Wood, J. Lu, and B. Langmead, “Improved metagenomic analysis with kraken 2,” Genome Biology, vol. 20, no. 1, pp. 1–13, 2019.
  14. A. C. Darling, B. Mau, F. R. Blattner, and N. T. Perna, “Mauve: multiple alignment of conserved genomic sequence with rearrangements,” Genome Research, vol. 14, no. 7, pp. 1394–1403, 2004.
  15. D. Kim, L. Song, F. P. Breitwieser, and S. L. Salzberg, “Centrifuge: rapid and sensitive classification of metagenomic sequences,” Genome Research, vol. 26, no. 12, pp. 1721–1729, 2016.
  16. T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman et al., “Opportunities and obstacles for deep learning in biology and medicine,” Journal of The Royal Society Interface, vol. 15, no. 141, p. 20170387, 2018.
  17. J. M. Bartoszewicz, A. Seidel, R. Rentzsch, and B. Y. Renard, “Deepac: predicting pathogenic potential of novel dna with reverse-complement neural networks,” Bioinformatics, vol. 36, no. 1, pp. 81–89, 2020.
  18. A. Busia, G. E. Dahl, C. Fannjiang, D. H. Alexander, E. Dorfman, R. Poplin, C. Y. McLean, P.-C. Chang, and M. DePristo, “A deep learning approach to pattern recognition for short dna sequences,” BioRxiv, p. 353474, 2019.
  19. H. Ashoor, X. Chen, W. Rosikiewicz, J. Wang, A. Cheng, P. Wang, Y. Ruan, and S. Li, “Graph embedding and unsupervised learning predict genomic sub-compartments from hic chromatin interaction data,” Nature Communications, vol. 11, no. 1, pp. 1–11, 2020.
  20. S. Hwang, C. Y. Kim, S. Yang, E. Kim, T. Hart, E. M. Marcotte, and I. Lee, “Humannet v2: human gene networks for disease research,” Nucleic Acids Research, vol. 47, no. D1, pp. D573–D580, 2019.
  21. Q. Liang, P. W. Bible, Y. Liu, B. Zou, and L. Wei, “Deepmicrobes: taxonomic classification for metagenomics with deep learning,” NAR Genomics and Bioinformatics, vol. 2, no. 1, p. lqaa009, 2020.
  22. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
  23. S. Narayanan, A. Ramachandran, S. N. Aakur, and A. Bagavathi, “Gradl: a framework for animal genome sequence classification with graph representations and deep learning,” in 2020 19th IEEE International Conference on Machine Learning and Applications.   IEEE, 2020, pp. 1297–1303.
  24. M. Queyrel, E. Prifti, A. Templier, and J.-D. Zucker, “Towards end-to-end disease prediction from raw metagenomic data,” bioRxiv, pp. 2020–10, 2021.
  25. H. Sun, K. Xie, L. Gao, J. Sui, T. Lin, and X. Ni, “Research on pseudo-ct imaging technique based on an ultrasound deformation field with binary mask in radiotherapy,” Medicine, vol. 97, no. 38, 2018.
  26. R. J. Nelson, J. M. Mooney, and W. S. Ewing, “Pseudo imaging,” in Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XII, vol. 6233.   International Society for Optics and Photonics, 2006, p. 62330M.
  27. X. Pennec, P. Cachier, and N. Ayache, “Tracking brain deformations in time sequences of 3d us images,” Pattern Recognition Letters, vol. 24, no. 4-5, pp. 801–813, 2003.
  28. S. C. Leu, Z. Huang, and Z. Lin, “Generation of pseudo-ct using high-degree polynomial regression on dual-contrast pelvic mri data,” Scientific Reports, vol. 10, no. 1, pp. 1–11, 2020.
  29. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  30. T. H. Nguyen, “Metagenome-based disease classification with deep learning and visualizations based on self-organizing maps,” in International Conference on Future Data and Security Engineering.   Springer, 2019, pp. 307–319.
  31. H. T. Nguyen, B. A. Nguyen, M. N. Nguyen, Q.-D. Truong, L. C. Nguyen, T. T. N. Banh, and P. D. Linh, “Growing self-organizing maps for metagenomic visualizations supporting disease classification,” in International Conference on Future Data and Security Engineering.   Springer, 2020, pp. 151–166.
  32. D. Reiman, A. A. Metwally, J. Sun, and Y. Dai, “Popphy-cnn: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 10, pp. 2993–3001, 2020.
  33. N. Karessli, Z. Akata, B. Schiele, and A. Bulling, “Gaze embeddings for zero-shot image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4525–4534.
  34. T. Wu, Y. Chen, Y. Gu, J. Wang, S. Zhang, and Z. Zhechen, “Multi-layer cross loss model for zero-shot human activity recognition,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining.   Springer, 2020, pp. 210–221.
  35. J. Zhang, P. Lertvittayakumjorn, and Y. Guo, “Integrating semantic knowledge to tackle zero-shot text classification,” in Proceedings of NAACL-HLT, 2019, pp. 1031–1040.
  36. M. Merrillees and L. Du, “Stratified sampling for extreme multi-label data,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining.   Springer, 2021, pp. 334–345.
  37. A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
  38. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Information Processing Systems, vol. 26, 2013.
  39. L. Ballerini, X. Li, R. B. Fisher, and J. Rees, “A query-by-example content-based image retrieval system of non-melanoma skin lesions,” in MICCAI International Workshop on Medical Content-Based Retrieval for Clinical Decision Support.   Springer, 2009, pp. 31–38.
  40. A. Bagari, A. Kumar, A. Kori, M. Khened, and G. Krishnamurthi, “A combined radio-histological approach for classification of low grade gliomas,” in International MICCAI Brainlesion Workshop.   Springer, 2018, pp. 416–427.
  41. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.
  43. T. Laver, J. Harrison, P. O’neill, K. Moore, A. Farbos, K. Paszkiewicz, and D. J. Studholme, “Assessing the performance of the oxford nanopore technologies minion,” Biomolecular Detection and Quantification, vol. 3, pp. 1–8, 2015.
  44. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  45. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 499–515.
  46. S. Federhen, “The ncbi taxonomy database,” Nucleic Acids Research, vol. 40, no. D1, pp. D136–D143, 2012.
  47. N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput, B. Robbertse, B. Smith-White, D. Ako-Adjei et al., “Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation,” Nucleic Acids Research, vol. 44, no. D1, pp. D733–D745, 2016.
  48. T. Tatusova, M. DiCuccio, A. Badretdin, V. Chetvernin, E. P. Nawrocki, L. Zaslavsky, A. Lomsadze, K. D. Pruitt, M. Borodovsky, and J. Ostell, “Ncbi prokaryotic genome annotation pipeline,” Nucleic Acids Research, vol. 44, no. 14, pp. 6614–6624, 2016.
  49. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  50. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 661–18 673, 2020.
  51. Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu, “Negative sample matters: A renaissance of metric learning for temporal grounding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2613–2623.
  52. K. Musgrave, S. Belongie, and S.-N. Lim, “A metric learning reality check,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16.   Springer, 2020, pp. 681–699.

Summary

We haven't generated a summary for this paper yet.