Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Levenshtein Distance Embedding with Poisson Regression for DNA Storage (2312.07931v1)

Published 13 Dec 2023 in cs.LG and q-bio.QM

Abstract: Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, 51–58.
  2. On the Size of Balls and Anticodes of Small Diameter Under the Fixed-Length Levenshtein Metric. IEEE Transactions on Information Theory, 69(4): 2324–2340.
  3. Levenshtein distance, sequence comparison and biological database search. IEEE transactions on information theory, 67(6): 3287–3294.
  4. Signature Verification Using a ”Siamese” Time Delay Neural Network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, 737–744. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
  5. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 712–725.
  6. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. Advances in Neural Information Processing Systems, 33: 15065–15076.
  7. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. Doha, Qatar: Association for Computational Linguistics.
  8. Next-generation digital information storage in DNA. Science, 337(6102): 1628–1628.
  9. Neural distance embeddings for biological sequences. Advances in Neural Information Processing Systems, 34: 18539–18551.
  10. Convolutional embedding for edit distance. In proceedings of the 43rd international ACM SIGIR conference on Research and Development in information retrieval, 599–608.
  11. DNA storage: research landscape and future prospects. National Science Review, 7(6): 1092–1107.
  12. DNA Fountain enables a robust and efficient storage architecture. science, 355(6328): 950–954.
  13. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. nature, 494(7435): 77–80.
  14. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie International Edition, 54(8): 2552–2555.
  15. Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage. In International Conference on Machine Learning, 8095–8108. PMLR.
  16. Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232.
  17. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
  18. String similarity joins: An experimental evaluation. Proceedings of the VLDB Endowment, 7(8): 625–636.
  19. Levenshtein, V. I.; et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, 707–710. Soviet Union.
  20. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5): 473–483.
  21. 3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data. BMC bioinformatics, 23(1): 1–18.
  22. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 54(5): 23–es.
  23. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proceedings of the National Academy of Sciences, 117(31): 18489–18496.
  24. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Briefings in Bioinformatics, 23(5). Bbac336.
  25. Clustering billions of reads for DNA data storage. Advances in Neural Information Processing Systems, 30.
  26. Large-scale analyses of human microbiomes reveal thousands of small, novel genes. Cell, 178(5): 1245–1259.
  27. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In 2008 3rd International Conference on Innovative Computing Information and Control, 569–569. IEEE.
  28. The string-to-string correction problem. Journal of the ACM (JACM), 21(1): 168–173.
  29. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nature Communications, 14(1): 628.
  30. Neural embeddings for nearest neighbor search under edit distance.
  31. SENSE: Siamese neural network for sequence embedding and alignment-free comparison. Bioinformatics, 35(11): 1820–1828.
  32. Starcode: sequence clustering based on all-pairs search. Bioinformatics, 31(12): 1913–1919.
Citations (2)

Summary

We haven't generated a summary for this paper yet.