Embed-Search-Align: DNA Sequence Alignment using Transformer Models (2309.11087v6)
Abstract: DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in 2 steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of LLMs in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have explored whether the same Transformer architecture can produce embeddings for DNA sequences. Such models have shown early promise in classifying short DNA sequences, such as detecting coding/non-coding regions, and enhancer, promoter sequences. However, performance at sequence classification tasks does not translate to sequence alignment, where it is necessary to search across the genome to align each read, a significantly longer-range task. We bridge this gap by framing the Sequence Alignment task for Transformer models as an "Embed-Search-Align" task. In this framework, a novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space where the read-fragment distance is used as a surrogate for alignment. Technical contributions include: (1) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (2) a DNA vector store to enable search across fragments on a global scale. DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem. DNA-ESA exceeds the performance of 6 Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.
- wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- On the opportunities and risks of foundation models, 2022.
- Language models are few-shot learners, 2020.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Nicos Christofides. Worst-case analysis of a new heuristic for the travelling salesman problem. Operations Research Forum, 3, 1976.
- The Use of Confidence or Fiducial Limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 12 1934.
- Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 03 2009.
- Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, 2022.
- The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences. bioRxiv, pages 2023–06, 2023. Publisher: Cold Spring Harbor Laboratory.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Simon G Gregory. Contig Assembly. John Wiley & Sons, Ltd, 2005.
- Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
- ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2012. Publisher: Oxford University Press.
- David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952. Publisher: IEEE.
- Origin of human chromosome 2: an ancestral telomere-telomere fusion. Proceedings of the National Academy of Sciences, 88(20):9051–9055, 1991.
- DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, 2021. Publisher: Oxford University Press.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019. Publisher: IEEE.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 35(3):421–432, 07 2018.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 05 2018.
- Fast and accurate short read alignment with burrows–wheeler transform. bioinformatics, 25(14):1754–1760, 2009.
- Mappability and read length. Frontiers in Genetics, 5, 2014.
- Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph. Briefings in Functional Genomics, 11(1):25–37, 12 2011.
- A draft human pangenome reference. Nature, 617(7960):312–324, 2023.
- Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993. Publisher: SIAM.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Versatile genome assembly evaluation with QUAST-LG. Bioinformatics, 34(13):i142–i150, 06 2018.
- MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
- The complete sequence of a human genome. Science, 376(6588):44–53, 2022.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback, 2023.
- Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897, 2020.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics.
- Re-inventing willis. Physics Reports, 502(1):1–35, 2011.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019.
- Bert rediscovers the classical nlp pipeline, 2019.
- Efficient architecture-aware acceleration of bwa-mem for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 314–324, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335, 2021.
- BrainBERT: Self-supervised representation learning for intracranial recordings. arXiv preprint arXiv:2302.14367, 2023.
- Dnabert-2: Efficient foundation model and benchmark for multi-species genome, 2023.
- Improving diversity in ranking using absorbing random walks. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 97–104, 2007.
- Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific data, 3(1):1–26, 2016.
- GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv, pages 2022–10, 2022. Publisher: Cold Spring Harbor Laboratory.