Finding Diverse Strings and Longest Common Subsequences in a Graph
Abstract: In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $\Delta$, where we consider two types of diversities of a set $\mathcal X$ of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in $\mathcal X$, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When $K$ is bounded, both problems are polynomial time solvable. In contrast, when $K$ is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters $K$ and $r$, where $r$ is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.
- Color-coding. J. ACM, 42(4):844–856, 1995.
- Synchronization and diversity of solutions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(10), pages 11516–11524, 2023.
- Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, 2012.
- Structure preserving reductions among convex optimization problems. JCSS, 21(1):136–153, 1980.
- Diversity of solutions: An exploration through the lens of fixed-parameter tractability theory. Artificial Intelligence, 303:103644, 2022.
- FPT algorithms for diverse collections of hitting sets. Algorithms, 12(12):254, 2019.
- An improved analysis for a greedy remote-clique algorithm using factor-revealing LPs. Algorithmica, 55(1):42–59, 2009.
- The parameterized complexity of sequence alignment and consensus. Theoretical Computer Science, 147(1-2):31–54, 1995.
- An FPT-algorithm for longest common subsequence parameterized by the maximum number of deletions. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 6:1–6:11, 2022.
- An improved analysis of local search for max-sum diversification. Math. Oper. Research, 44(4):1494–1509, 2019.
- Approximation algorithms for dispersion problems. J. of Algorithms, 38(2):438–465, 2001.
- A compact DAG for storing and searching maximal common subsequences. In ISAAC 2023, pages 21:1–21:15, 2023.
- Introduction to Algorithms, fourth edition. MIT Press, 2022.
- The string edit distance matching problem with moves. ACM Transactions on Algorithms (TALG), 3(1):1–19, 2007.
- Parameterized Algorithms. Springer, 2015.
- Geometry of Cuts and Metrics, volume 15 of Algorithms and Combinatorics. Springer, 1997.
- Parameterized complexity. Springer, 2012.
- Erhan Erkut. The discrete p𝑝pitalic_p-dispersion problem. European Journal of Operational Research, 46(1):48–60, 1990.
- Faster fixed-parameter tractable algorithms for matching and packing problems. Algorithmica, 52:167–176, 2008.
- The DIVERSE X Paradigm (Open problems). In Henning Fernau, Petr Golovach, Marie-France Sagot, et al., editors, Algorithmic enumeration: Output-sensitive, input-sensitive, parameterized, approximative (Dagstuhl Seminar 18421), Dagstuhl Reports, 8(10), 2019.
- The normalized edit distance with uniform operation costs is a metric. In 33rd Ann. Symp. on Combinatorial Pattern Matching (CPM 2022), LIPIcs, volume 223, pages 17:1–17:17, 2022.
- Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series). Springer, 2006.
- Computers and intractability: A guide to the theory of NP-completeness, 1979.
- Similarity search in high dimensions via hashing. In VLDB, volume 99(6), pages 518–529, 1999.
- Dan Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55(1):141–154, 1993.
- The longest common subsequence problem for small alphabet size between many strings. In ISAAC’92, pages 469–478. Springer, 1992.
- A framework to design approximation algorithms for finding diverse solutions in combinatorial problems. In AAAI 2023, pages 3968–3976, 2023.
- Finding diverse trees, paths, and more. In AAAI 2021, pages 3778–3786, 2021.
- Dispersing facilities on a network. In the TIMS/ORSA Joint National Meeting, Washington, D.C. RUTCOR, Rutgers University., 1988. Presentation.
- Efficient algorithms for enumerating maximal common subsequences of two strings. CoRR, abs/2307.10552, 2023. arXiv:2307.10552.
- Daniel S Hirschberg. Recent results on the complexity of common-subsequence problems, in Time warps, String edits, and Macromolecules, pages 323–328. Addison-Wesley, 1983.
- Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979.
- Two algorithms for the longest common subsequence of three (or more) strings. In CPM 1992, pages 214–229. Springer, 1992.
- Michael J. Kuby. Programming models for facility dispersion: The p𝑝pitalic_p-dispersion and maxisum dispersion problems. Geographical Analysis, 19(4):315–329, 1987.
- Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710, 1966.
- L1 pattern matching lower bound. Information Processing Letters, 105(4):141–143, 2008.
- Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835–848, 1994.
- David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2):322–336, 1978.
- Maximization of minimum weighted Hamming distance between set pairs. In Asian Conference on Machine Learning, pages 895–910. PMLR, 2024.
- Heuristic and special case algorithms for dispersion problems. Operations research, 42(2):299–310, 1994.
- David Sankoff. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences, 69(1):4–6, 1972.
- Douglas R. Shier. A min-max theorem for p𝑝pitalic_p-center problems on a tree. Transportation Science, 11(3):243–252, 1977.
- Vijay V. Vazirani. Approximation Algorithms. Springer, 2010.
- The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.
- A study on two geometric location problems. Information Processing Letters, 28(6):281–286, 1988.
- Dan E. Willard. Log-logarithmic worst-case range queries are possible in space θ(n)𝜃𝑛\theta(n)italic_θ ( italic_n ). Information Processing Letters, 17(2):81–84, 1983.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.