Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage (2312.12717v1)

Published 20 Dec 2023 in cs.IT, cs.LG, and math.IT

Abstract: Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correcting code is still a challenge. In this paper, we propose an innovative approach that utilizes deep Levenshtein distance embedding to bypass these mathematical challenges. By representing the Levenshtein distance between two sequences as a conventional distance between their corresponding embedding vectors, the inherent structural property of Levenshtein distance is revealed in the friendly embedding space. Leveraging this embedding space, we introduce the DoDo-Code, an IDS-correcting code that incorporates deep embedding of Levenshtein distance, deep embedding-based codeword search, and deep embedding-based segment correcting. To address the requirements of DNA storage, we also present a preliminary algorithm for long sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting code designed using plausible deep learning methodologies, potentially paving the way for a new direction in error-correcting code research. It is also the first IDS code that exhibits characteristics of being `optimal' in terms of redundancy, significantly outperforming the mainstream IDS-correcting codes of the Varshamov-Tenengolts code family in code rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, 2012.
  2. N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013.
  3. R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust chemical preservation of digital information on DNA in silica with error-correcting codes,” Angewandte Chemie International Edition, vol. 54, no. 8, pp. 2552–2555, 2015.
  4. Y. Erlich and D. Zielinski, “DNA Fountain enables a robust and efficient storage architecture,” Science, vol. 355, no. 6328, pp. 950–954, 2017.
  5. L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen, et al., “Random access in large-scale DNA data storage,” Nature biotechnology, vol. 36, no. 3, pp. 242–248, 2018.
  6. Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian, “DNA storage: research landscape and future prospects,” National Science Review, vol. 7, pp. 1092–1107, 01 2020.
  7. W. Chen, M. Han, J. Zhou, Q. Ge, P. Wang, X. Zhang, S. Zhu, L. Song, and Y. Yuan, “An artificial chromosome for data storage,” National Science Review, vol. 8, p. nwab028, 02 2021.
  8. A. El-Shaikh, M. Welzel, D. Heider, and B. Seeger, “High-scale random access on DNA storage systems,” NAR Genomics and Bioinformatics, vol. 4, p. lqab126, 01 2022.
  9. Z. Ping, D. Ma, X. Huang, S. Chen, L. Liu, F. Guo, S. J. Zhu, and Y. Shen, “Carbon-based archiving: current progress and future prospects of DNA-based data storage,” GigaScience, vol. 8, 06 2019. giz075.
  10. M. Blawat, K. Gaedke, I. Hütter, X.-M. Chen, B. Turczyk, S. Inverso, B. W. Pruitt, and G. M. Church, “Forward error correction for DNA data storage,” Procedia Computer Science, vol. 80, pp. 1011 – 1022, 2016. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA.
  11. A. Lenz, I. Maarouf, L. Welter, A. Wachter-Zeh, E. Rosnes, and A. Graell i Amat, “Concatenated codes for recovery from multiple reads of DNA sequences,” in 2020 IEEE Information Theory Workshop (ITW), pp. 1–5, 2021.
  12. F. Sellers, “Bit loss and gain correction code,” IRE Transactions on Information theory, vol. 8, no. 1, pp. 35–38, 1962.
  13. B. Haeupler and A. Shahrasbi, “Synchronization strings and codes for insertions and deletions—a survey,” IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3190–3206, 2021.
  14. M. Davey and D. Mackay, “Reliable communication over channels with insertions, deletions, and substitutions,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 687–698, 2001.
  15. H. D. Pfister and I. Tal, “Polar codes for channels with insertions, deletions, and substitutions,” in 2021 IEEE International Symposium on Information Theory (ISIT), pp. 2554–2559, 2021.
  16. Z. Yan, C. Liang, and H. Wu, “A segmented-edit error-correcting code with re-synchronization function for DNA-based storage systems,” IEEE Transactions on Emerging Topics in Computing, pp. 1–13, 2022.
  17. M. Welzel, P. M. Schwarz, H. F. Löchel, T. Kabdullayeva, S. Clemens, A. Becker, B. Freisleben, and D. Heider, “DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage,” Nature Communications, vol. 14, no. 1, p. 628, 2023.
  18. W. H. Press, J. A. Hawkins, S. K. Jones, J. M. Schaub, and I. J. Finkelstein, “HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints,” Proceedings of the National Academy of Sciences, vol. 117, no. 31, pp. 18489–18496, 2020.
  19. R. R. Varshamov and G. Tenenholtz, “A code for correcting a single asymmetric error,” Automatica i Telemekhanika, vol. 26, no. 2, pp. 288–292, 1965.
  20. V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet physics. Doklady, vol. 10, pp. 707–710, 1965.
  21. L. Calabi and W. Hartnett, “A family of codes for the correction of substitution and synchronization errors,” IEEE Transactions on Information Theory, vol. 15, no. 1, pp. 102–106, 1969.
  22. E. Tanaka and T. Kasai, “Synchronization and substitution error-correcting codes for the levenshtein metric,” IEEE Transactions on Information Theory, vol. 22, no. 2, pp. 156–162, 1976.
  23. K. Cai, Y. M. Chee, R. Gabrys, H. M. Kiah, and T. T. Nguyen, “Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality,” IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3438–3451, 2021.
  24. N. J. Sloane, “On single-deletion-correcting codes,” Codes and designs, vol. 10, pp. 273–291, 2000.
  25. R. Gabrys, V. Guruswami, J. Ribeiro, and K. Wu, “Beyond single-deletion correcting codes: Substitutions and transpositions,” IEEE Transactions on Information Theory, vol. 69, no. 1, pp. 169–186, 2023.
  26. T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.
  27. D. Bar-Lev, T. Etzion, and E. Yaakobi, “On the size of balls and anticodes of small diameter under the fixed-length levenshtein metric,” IEEE Transactions on Information Theory, vol. 69, no. 4, pp. 2324–2340, 2023.
  28. G. Wang and Q. Wang, “On the size distribution of levenshtein balls with radius one,” arXiv preprint arXiv:2204.02201, 2022.
  29. W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,” Journal of Computer and System Sciences, vol. 20, no. 1, pp. 18–31, 1980.
  30. A. Backurs and P. Indyk, “Edit distance cannot be computed in strongly subquadratic time (unless SETH is false),” in Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15, (New York, NY, USA), p. 51–58, Association for Computing Machinery, 2015.
  31. P. Cunningham and S. J. Delany, “K-nearest neighbour classifiers - a tutorial,” ACM Comput. Surv., vol. 54, jul 2021.
  32. R. Tarjan, “Depth-first search and linear graph algorithms,” SIAM Journal on Computing, vol. 1, no. 2, pp. 146–160, 1972.
  33. R. Ostrovsky and Y. Rabani, “Low distortion embeddings for edit distance,” J. ACM, vol. 54, p. 23–es, Oct. 2007.
  34. D. Chakraborty, E. Goldenberg, and M. Koucký, “Streaming algorithms for embedding and computing edit distance in the low distance regime,” in Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, (New York, NY, USA), p. 712–725, Association for Computing Machinery, 2016.
  35. X. Zhang, Y. Yuan, and P. Indyk, “Neural embeddings for nearest neighbor search under edit distance,” 2020.
  36. X. Dai, X. Yan, K. Zhou, Y. Wang, H. Yang, and J. Cheng, “Convolutional embedding for edit distance,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pp. 599–608, ACM, 2020.
  37. G. Corso, Z. Ying, M. Pándy, P. Veličković, J. Leskovec, and P. Liò, “Neural distance embeddings for biological sequences,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 18539–18551, Curran Associates, Inc., 2021.
  38. A. J. Guo, C. Liang, and Q.-H. Hou, “Deep squared Euclidean approximation to the levenshtein distance for DNA storage,” in International Conference on Machine Learning, pp. 8095–8108, PMLR, 2022.
  39. X. Wei, A. J. Guo, S. Sun, M. Wei, and W. Yu, “Levenshtein distance embedding with poisson regression for dna storage,” arXiv preprint arXiv:2312.07931, 2023.
  40. J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
  41. J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Transactions on Mathematical Software (TOMS), vol. 3, no. 3, pp. 209–226, 1977.
  42. I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300–304, 1960.
  43. J. Hadamard, “Sur la distribution des zéros de la fonction ζ⁢(s)𝜁𝑠\zeta(s)italic_ζ ( italic_s ) et ses conséquences arithmétiques,” Bulletin de la Societé mathematique de France, vol. 24, pp. 199–220, 1896.
  44. M. Hostetter, “Galois: A performant NumPy extension for Galois fields,” 11 2020.
  45. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
  46. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456, PMLR, 2015.
  47. J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” in Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, (San Francisco, CA, USA), p. 737–744, Morgan Kaufmann Publishers Inc., 1993.
Citations (1)

Summary

We haven't generated a summary for this paper yet.