A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression (2302.06506v2)
Abstract: The model of generalized automata, introduced by Eilenberg in 1974, allows representing a regular language more concisely than conventional automata by allowing edges to be labeled not only with characters, but also strings. Giammaresi and Montalbano introduced a notion of determinism for generalized automata [STACS 1995]. While generalized deterministic automata retain many properties of conventional deterministic automata, the uniqueness of a minimal generalized deterministic automaton is lost. In the first part of the paper, we show that the lack of uniqueness can be explained by introducing a set $ \mathcal{W(A)} $ associated with a generalized automaton $ \mathcal{A} $. By fixing $ \mathcal{W(A)} $, we are able to derive for the first time a full Myhill-Nerode theorem for generalized automata, which contains the textbook Myhill-Nerode theorem for conventional automata as a degenerate case. In the second part of the paper, we show that the set $ \mathcal{W(A)} $ leads to applications for pattern matching and data compression. Wheeler automata [TCS 2017, SODA 2020] are a popular class of automata that can be compactly stored using $ e \log \sigma (1 + o(1)) + O(e) $ bits ($ e $ being the number of edges, $ \sigma $ being the size of the alphabet) in such a way that pattern matching queries can be solved in $ \tilde{O}(m) $ time ($ m $ being the length of the pattern). In the paper, we show how to extend these results to generalized automata. More precisely, a Wheeler generalized automata can be stored using $ \mathfrak{e} \log \sigma (1 + o(1)) + O(e + rn) $ bits so that pattern matching queries can be solved in $ \tilde{O}(r m) $ time, where $ \mathfrak{e} $ is the total length of all edge labels, $ r $ is the maximum length of an edge label and $ n $ is the number of states.
- Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial Pattern Matching, pages 1–10, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg.
- Linear-time minimization of wheeler dfas. In 2022 Data Compression Conference (DCC), pages 53–62, 2022. doi:10.1109/DCC52660.2022.00013.
- Regular Languages meet Prefix Sorting, pages 911–930. URL: https://epubs.siam.org/doi/abs/10.1137/1.9781611975994.55, arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611975994.55, doi:10.1137/1.9781611975994.55.
- Wheeler languages. Information and Computation, 281:104820, 2021. URL: https://www.sciencedirect.com/science/article/pii/S0890540121001504, doi:10.1016/j.ic.2021.104820.
- Pattern matching in hypertext. Journal of Algorithms, 35(1):82–99, 2000. URL: https://www.sciencedirect.com/science/article/pii/S0196677499910635, doi:10.1006/jagm.1999.1063.
- Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81–108, 2022. doi:10.1007/s11047-022-09882-6.
- SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455–477, 2012. PMID: 22506599. arXiv:https://doi.org/10.1089/cmb.2012.0021, doi:10.1089/cmb.2012.0021.
- Sorting Finite Automata via Partition Refinement. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1–15:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2023/18668, doi:10.4230/LIPIcs.ESA.2023.15.
- Succinct de bruijn graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225–235, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
- M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.
- Computing matching statistics on wheeler dfas. In 2023 Data Compression Conference (DCC), pages 150–159, 2023. doi:10.1109/DCC55655.2023.00023.
- Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in o(|e|2+|v|5/2)𝑜superscript𝑒2superscript𝑣52o(|e|^{2}+|v|^{5/2})italic_o ( | italic_e | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_v | start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT ) time. In 2022 Data Compression Conference (DCC), pages 272–281, 2022. doi:10.1109/DCC52660.2022.00035.
- Nicola Cotumaccio. Prefix Sorting DFAs: A Recursive Algorithm. In Satoru Iwata and Naonori Kakimura, editors, 34th International Symposium on Algorithms and Computation (ISAAC 2023), volume 283 of Leibniz International Proceedings in Informatics (LIPIcs), pages 22:1–22:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2023.22, doi:10.4230/LIPIcs.ISAAC.2023.22.
- Co-lexicographically ordering automata and regular languages - part i. J. ACM, 70(4), aug 2023. doi:10.1145/3607471.
- Space-time trade-offs for the lcp array of wheeler dfas. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval, pages 143–156, Cham, 2023. Springer Nature Switzerland.
- On indexing and compressing finite automata. In Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’21, page 2585–2599, USA, 2021. Society for Industrial and Applied Mathematics.
- Samuel Eilenberg. Automata, Languages, and Machines. Academic Press, Inc., USA, 1974.
- On the complexity of string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 55:1–55:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.55.
- Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless seth fails. In Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science, pages 608–622, Cham, 2021. Springer International Publishing.
- On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3), apr 2023. doi:10.1145/3588334.
- Algorithms and complexity on indexing elastic founder graphs. In Hee-Kap Ahn and Kunihiko Sadakane, editors, 32nd International Symposium on Algorithms and Computation, ISAAC 2021, December 6-8, 2021, Fukuoka, Japan, volume 212 of LIPIcs, pages 20:1–20:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ISAAC.2021.20.
- Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586–1623, 2023. doi:10.1007/s00453-022-01007-w.
- P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
- Indexing compressed text. J. ACM, 52(4):552–581, jul 2005. doi:10.1145/1082036.1082039.
- Wheeler graphs: A framework for bwt-based data structures. Theoretical Computer Science, 698:67–78, 2017. Algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo). URL: https://www.sciencedirect.com/science/article/pii/S0304397517305285, doi:10.1016/j.tcs.2017.06.016.
- Deterministic generalized automata. In Ernst W. Mayr and Claude Puech, editors, STACS 95, 12th Annual Symposium on Theoretical Aspects of Computer Science, Munich, Germany, March 2-4, 1995, Proceedings, volume 900 of Lecture Notes in Computer Science, pages 325–336. Springer, 1995. doi:10.1007/3-540-59042-0_84.
- Deterministic generalized automata. Theor. Comput. Sci., 215(1-2):191–208, 1999. doi:10.1016/S0304-3975(97)00166-7.
- On the complexity of recognizing wheeler graphs. Algorithmica, 84(3):784–814, mar 2022. doi:10.1007/s00453-021-00917-5.
- Kosaburo Hashiguchi. Algorithms for determining the smallest number of nonterminals (states) sufficient for generating (accepting) a regular language. In Javier Leach Albert, Burkhard Monien, and Mario Rodríguez Artalejo, editors, Automata, Languages and Programming, pages 641–648, Berlin, Heidelberg, 1991. Springer Berlin Heidelberg.
- John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Zvi Kohavi and Azaria Paz, editors, Theory of Machines and Computations, pages 189–196. Academic Press, 1971. URL: https://www.sciencedirect.com/science/article/pii/B9780124177505500221, doi:10.1016/B978-0-12-417750-5.50022-1.
- Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2006.
- A new algorithm for DNA sequence assembly. Journal of computational biology : a journal of computational molecular cell biology, 2 2:291–306, 1995.
- Faster prefix-sorting algorithms for deterministic finite automata. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 16:1–16:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.CPM.2023.16.
- Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977. arXiv:https://doi.org/10.1137/0206024, doi:10.1137/0206024.
- Approximate string matching with arbitrary costs for text and hypertext. In Advances In Structural And Syntactic Pattern Recognition, pages 22–33. World Scientific, 1992.
- Genome-Scale Algorithm Design: Bioinformatics in the Era of High-Throughput Sequencing. Cambridge University Press, 2 edition, 2023.
- Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theor. Comput. Sci., 237(1–2):455–463, apr 2000. doi:10.1016/S0304-3975(99)00333-3.
- Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016. doi:10.1017/CBO9781316588284.
- String matching in hypertext. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, pages 318–329, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg.
- An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748–9753, 2001. URL: https://www.pnas.org/doi/abs/10.1073/pnas.171285098, arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.171285098, doi:10.1073/pnas.171285098.
- Aligning sequences to general graphs in o(v + me) time. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/11/08/216127, arXiv:https://www.biorxiv.org/content/early/2017/11/08/216127.full.pdf, doi:10.1101/216127.
- Linear time construction of indexable elastic founder graphs. In Cristina Bazgan and Henning Fernau, editors, Combinatorial Algorithms, pages 480–493, Cham, 2022. Springer International Publishing.
- Efficient construction of an assembly string graph using the fm-index. Bioinform., 26(12):367–373, 2010. doi:10.1093/bioinformatics/btq217.