Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling (2405.10812v2)

Published 13 May 2024 in q-bio.GN and cs.AI

Abstract: Similar to natural LLMs, pre-trained genome LLMs are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome LLMs. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Determinants of enhancer and promoter activities of regulatory elements. Nature Reviews Genetics, 21(2):71–87, 2020.
  2. Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biology, 18, 2017.
  3. Layer normalization. ArXiv, abs/1607.06450, 2016.
  4. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell, 185(18):3426–3440, 2022.
  7. Maskgit: Masked generative image transformer. CVPR, pp.  11305–11315, 2022.
  8. N-gram language modeling using recurrent neural network estimation. arXiv preprint arXiv:1703.10724, 2017.
  9. An encyclopedia of mouse dna elements (mouse encode). Genome biology, 13:1–5, 2012a.
  10. An encyclopedia of mouse dna elements (mouse encode). Genome biology, 13:1–5, 2012b.
  11. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pp.  2023–01, 2023.
  12. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
  15. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  16. Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research, 41(D1):D157–D164, 2013.
  17. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12873–12883, June 2021.
  18. Gray, R. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
  19. Masked autoencoders are scalable vision learners. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. In Proceedings of the International Conference on Machine Learning (ICML), 2023a.
  22. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. In Proceedings of the International Conference on Machine Learning (ICML), 2023b.
  23. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  24. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26:990 – 999, 2015.
  25. Gisaid’s role in pandemic response. China CDC weekly, 3(49):1049, 2021.
  26. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  27. iproep: a computational predictor for predicting promoter. Molecular Therapy-Nucleic Acids, 17:337–346, 2019.
  28. Bert-promoter: An improved sequence-based predictor of dna promoter using bert pre-trained model and shap feature selection. Computational Biology and Chemistry, 99:107732, 2022.
  29. Autoregressive image generation using residual quantization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11523–11532, 2022.
  30. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  31. Language quantized autoencoders: Towards unsupervised text-image alignment. arXiv preprint arXiv:2302.00902, 2023.
  32. A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  33. Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197–206, 2015.
  34. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.
  35. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  36. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  37. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics, 33:i92 – i101, 2017.
  38. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020.
  39. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. ArXiv, 2023.
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  41. Pango lineage designation and assignment using sars-cov-2 spike gene nucleotide sequences. BMC genomics, 23(1):1–13, 2022.
  42. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  43. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  44. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4510–4520, 2018.
  45. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp.  618–626, 2017.
  46. Neural machine translation of rare words with subword units. ArXiv, abs/1508.07909, 2015.
  47. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp.  6309–6318, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  48. Dna methylation-based forensic age prediction using artificial neural networks and next generation sequencing. Forensic Science International. Genetics, 28:225 – 236, 2017.
  49. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
  50. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  51. Splicefinder: ab initio prediction of splice sites using convolutional neural network. BMC bioinformatics, 20:1–13, 2019.
  52. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
  53. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10459–10469, 2023a.
  54. SPAE: Semantic pyramid autoencoder for multimodal generation with frozen LLMs. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  55. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023c.
  56. Deep sampling of grna in the human genome and deep-learning-informed prediction of grna activities. Cell Discovery, 9(1):48, 2023.
  57. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12:931–934, 2015.
  58. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. In International Conference on Learning Representations (ICLR), 2024.
  59. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18:1196 – 1203, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com