VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling (2405.10812v2)
Abstract: Similar to natural LLMs, pre-trained genome LLMs are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome LLMs. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
- Determinants of enhancer and promoter activities of regulatory elements. Nature Reviews Genetics, 21(2):71–87, 2020.
- Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biology, 18, 2017.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell, 185(18):3426–3440, 2022.
- Maskgit: Masked generative image transformer. CVPR, pp. 11305–11315, 2022.
- N-gram language modeling using recurrent neural network estimation. arXiv preprint arXiv:1703.10724, 2017.
- An encyclopedia of mouse dna elements (mouse encode). Genome biology, 13:1–5, 2012a.
- An encyclopedia of mouse dna elements (mouse encode). Genome biology, 13:1–5, 2012b.
- The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pp. 2023–01, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research, 41(D1):D157–D164, 2013.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12873–12883, June 2021.
- Gray, R. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
- Masked autoencoders are scalable vision learners. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. In Proceedings of the International Conference on Machine Learning (ICML), 2023a.
- Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. In Proceedings of the International Conference on Machine Learning (ICML), 2023b.
- Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
- Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26:990 – 999, 2015.
- Gisaid’s role in pandemic response. China CDC weekly, 3(49):1049, 2021.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- iproep: a computational predictor for predicting promoter. Molecular Therapy-Nucleic Acids, 17:337–346, 2019.
- Bert-promoter: An improved sequence-based predictor of dna promoter using bert pre-trained model and shap feature selection. Computational Biology and Chemistry, 99:107732, 2022.
- Autoregressive image generation using residual quantization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11523–11532, 2022.
- Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Language quantized autoencoders: Towards unsupervised text-image alignment. arXiv preprint arXiv:2302.00902, 2023.
- A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197–206, 2015.
- Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
- Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics, 33:i92 – i101, 2017.
- Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. ArXiv, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pango lineage designation and assignment using sars-cov-2 spike gene nucleotide sequences. BMC genomics, 23(1):1–13, 2022.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520, 2018.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 618–626, 2017.
- Neural machine translation of rare words with subword units. ArXiv, abs/1508.07909, 2015.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6309–6318, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- Dna methylation-based forensic age prediction using artificial neural networks and next generation sequencing. Forensic Science International. Genetics, 28:225 – 236, 2017.
- 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
- Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
- Splicefinder: ab initio prediction of splice sites using convolutional neural network. BMC bioinformatics, 20:1–13, 2019.
- Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
- Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a.
- SPAE: Semantic pyramid autoencoder for multimodal generation with frozen LLMs. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023c.
- Deep sampling of grna in the human genome and deep-learning-informed prediction of grna activities. Cell Discovery, 9(1):48, 2023.
- Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12:931–934, 2015.
- Dnabert-2: Efficient foundation model and benchmark for multi-species genome. In International Conference on Learning Representations (ICLR), 2024.
- Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18:1196 – 1203, 2021.