Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Counting overlapping pairs of words (2405.09393v3)

Published 15 May 2024 in cs.DM and cs.DS

Abstract: A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same correlation, it is relevant to count how many pairs of words share the same correlation depending on the alphabet size and word length n. We exhibit recurrences to compute the number of such pairs -- which is termed population size -- for any correlation; for this, we exploit a relationship between overlaps of two words and self-overlap of one word. This theorem allows us to compute the number of pairs with a longest overlap of a given length and to show that the expected length of the longest border of two words asymptotically converges, which solves two open questions raised by Gabric in 2022. Finally, we also provide bounds for the asymptotic of the population ratio of any correlation. Given the importance of word overlaps in areas like word combinatorics, bioinformatics, and digital communication, our results may ease analyses of algorithms for string processing, code design, or genome assembly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. A simple suboptimal construction of cross-bifix-free codes. Cryptography and Communications, 6(6):27–37, 2014.
  2. A new approach to cross-bifix-free sets. IEEE Transactions on Information Theory, 58(6):4058–4063, 2012.
  3. Constructions and bounds for codes with restricted overlaps. IEEE Transactions on Information Theory, 70(4):2479–2490, 2024.
  4. On A conjecture by eriksson concerning overlap in strings. Comb. Probab. Comput., 8(5):429–440, 1999.
  5. Daniel Gabric. Mutual borders and overlaps. IEEE Transactions on Information Theory, 68(10):6888–6893, 2022.
  6. Periods in strings. Journal of Combinatorial Theory, Series. A, 30:19–42, 1981.
  7. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A, 30(2):183–208, 1981.
  8. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
  9. An efficient algorithm for the All Pairs Suffix-Prefix Problem. Inf Proc Letters, 41(4):181–185, 1992.
  10. Periods and binary words. Journal of Combinatorial Theory, Series A, 89(2):298–303, 2000.
  11. Fast pattern matching in strings. SIAM Journal of Computing, 6:323–350, 1977.
  12. A fast algorithm for the All-Pairs Suffix–Prefix problem. Theoretical Computer Science, 698:14–24, 2017.
  13. M. Lothaire, editor. Algebraic combinatorics on Words. Cambridge University Press, second edition, 1997.
  14. M. Lothaire, editor. Combinatorics on Words. Cambridge University Press, second edition, 1997.
  15. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015.
  16. Peter Tolstrup Nielsen. A note on bifix-free sequences (corresp.). IEEE Trans. Inf. Theory, 19(5):704–706, 1973.
  17. Peter Tolstrup Nielsen. On the expected duration of a search for a fixed pattern in random data (corresp.). IEEE Transactions on Information Theory, 19(5):702–704, 1973.
  18. Theory and Application of Marsaglia’s Monkey Test for Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, 5(2):87–100, April 1995.
  19. Exact and efficient computation of the expected number of missing and common words in random texts. In Raffaele Giancarlo and David Sankoff, editors, Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, volume 1848 of Lecture Notes in Computer Science, pages 375–387. Springer, 2000.
  20. On the distribution of the number of missing words in random texts. Combinatorics, Probability and Computing, 12(01), Jan 2003.
  21. Eric Rivals. Incremental algorithms for computing the set of period sets. HAL, 2024. lirmm-04531880, 22 pages.
  22. Combinatorics of Periods in Strings. In F. Orejas, P. Spirakis, and J. van Leuween, editors, ICALP 2001, Proc. of the 28th International Colloquium on Automata, Languages and Programming, (ICALP), Crete, Greece, July 8-12, 2001, volume 2076 of Lecture Notes in Computer Science, pages 615–626. Springer Verlag, 2001.
  23. Combinatorics of periods in strings. Journal of Combinatorial Theory, Series A, 104(1):95–113, 2003.
  24. Convergence of the Number of Period Sets in Strings. In Kousha Etessami, Uriel Feige, and Gabriele Puppis, editors, 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023), volume 261 of Leibniz International Proceedings in Informatics (LIPIcs), pages 100:1–100:14, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  25. DNA, Words and Models. Cambrigde University Press, 2005.
  26. William F. Smyth. Computating Pattern in Strings. Pearson - Addison Wesley, 2003.
  27. An improved algorithm for the All-Pairs Suffix–Prefix problem. J. of Discrete Algorithms, 37:34–43, 2016.
  28. Approximate All-Pairs Suffix/Prefix overlaps. Inf. Comput., 213:49–58, 2012.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com