Papers
Topics
Authors
Recent
2000 character limit reached

Finding Diverse Strings and Longest Common Subsequences in a Graph

Published 30 Apr 2024 in cs.DS, cs.CC, and cs.FL | (2405.00131v2)

Abstract: In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $\Delta$, where we consider two types of diversities of a set $\mathcal X$ of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in $\mathcal X$, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When $K$ is bounded, both problems are polynomial time solvable. In contrast, when $K$ is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters $K$ and $r$, where $r$ is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Color-coding. J. ACM, 42(4):844–856, 1995.
  2. Synchronization and diversity of solutions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(10), pages 11516–11524, 2023.
  3. Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, 2012.
  4. Structure preserving reductions among convex optimization problems. JCSS, 21(1):136–153, 1980.
  5. Diversity of solutions: An exploration through the lens of fixed-parameter tractability theory. Artificial Intelligence, 303:103644, 2022.
  6. FPT algorithms for diverse collections of hitting sets. Algorithms, 12(12):254, 2019.
  7. An improved analysis for a greedy remote-clique algorithm using factor-revealing LPs. Algorithmica, 55(1):42–59, 2009.
  8. The parameterized complexity of sequence alignment and consensus. Theoretical Computer Science, 147(1-2):31–54, 1995.
  9. An FPT-algorithm for longest common subsequence parameterized by the maximum number of deletions. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 6:1–6:11, 2022.
  10. An improved analysis of local search for max-sum diversification. Math. Oper. Research, 44(4):1494–1509, 2019.
  11. Approximation algorithms for dispersion problems. J. of Algorithms, 38(2):438–465, 2001.
  12. A compact DAG for storing and searching maximal common subsequences. In ISAAC 2023, pages 21:1–21:15, 2023.
  13. Introduction to Algorithms, fourth edition. MIT Press, 2022.
  14. The string edit distance matching problem with moves. ACM Transactions on Algorithms (TALG), 3(1):1–19, 2007.
  15. Parameterized Algorithms. Springer, 2015.
  16. Geometry of Cuts and Metrics, volume 15 of Algorithms and Combinatorics. Springer, 1997.
  17. Parameterized complexity. Springer, 2012.
  18. Erhan Erkut. The discrete p𝑝pitalic_p-dispersion problem. European Journal of Operational Research, 46(1):48–60, 1990.
  19. Faster fixed-parameter tractable algorithms for matching and packing problems. Algorithmica, 52:167–176, 2008.
  20. The DIVERSE X Paradigm (Open problems). In Henning Fernau, Petr Golovach, Marie-France Sagot, et al., editors, Algorithmic enumeration: Output-sensitive, input-sensitive, parameterized, approximative (Dagstuhl Seminar 18421), Dagstuhl Reports, 8(10), 2019.
  21. The normalized edit distance with uniform operation costs is a metric. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 17:1–17:17, 2022.
  22. Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series). Springer, 2006.
  23. Computers and intractability: A guide to the theory of NP-completeness, 1979.
  24. Similarity search in high dimensions via hashing. In VLDB, volume 99(6), pages 518–529, 1999.
  25. Dan Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55(1):141–154, 1993.
  26. The longest common subsequence problem for small alphabet size between many strings. In ISAAC’92, pages 469–478. Springer, 1992.
  27. A framework to design approximation algorithms for finding diverse solutions in combinatorial problems. In AAAI 2023, pages 3968–3976, 2023.
  28. Finding diverse trees, paths, and more. In AAAI 2021, pages 3778–3786, 2021.
  29. Dispersing facilities on a network. In the TIMS/ORSA Joint National Meeting, Washington, D.C. RUTCOR, Rutgers University., 1988. Presentation.
  30. Efficient algorithms for enumerating maximal common subsequences of two strings. CoRR, abs/2307.10552, 2023. arXiv:2307.10552.
  31. Daniel S Hirschberg. Recent results on the complexity of common-subsequence problems, in Time warps, String edits, and Macromolecules, pages 323–328. Addison-Wesley, 1983.
  32. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979.
  33. Two algorithms for the longest common subsequence of three (or more) strings. In CPM 1992, pages 214–229. Springer, 1992.
  34. Michael J. Kuby. Programming models for facility dispersion: The p𝑝pitalic_p-dispersion and maxisum dispersion problems. Geographical Analysis, 19(4):315–329, 1987.
  35. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710, 1966.
  36. L1 pattern matching lower bound. Information Processing Letters, 105(4):141–143, 2008.
  37. Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835–848, 1994.
  38. David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2):322–336, 1978.
  39. Maximization of minimum weighted Hamming distance between set pairs. In Asian Conference on Machine Learning, pages 895–910. PMLR, 2024.
  40. Heuristic and special case algorithms for dispersion problems. Operations research, 42(2):299–310, 1994.
  41. David Sankoff. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences, 69(1):4–6, 1972.
  42. Douglas R. Shier. A min-max theorem for p𝑝pitalic_p-center problems on a tree. Transportation Science, 11(3):243–252, 1977.
  43. Vijay V. Vazirani. Approximation Algorithms. Springer, 2010.
  44. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.
  45. A study on two geometric location problems. Information Processing Letters, 28(6):281–286, 1988.
  46. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space θ⁢(n)𝜃𝑛\theta(n)italic_θ ( italic_n ). Information Processing Letters, 17(2):81–84, 1983.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.