Papers
Topics
Authors
Recent
2000 character limit reached

$LCSk$++: Practical similarity metric for long strings

Published 9 Jul 2014 in cs.DS | (1407.2407v1)

Abstract: In this paper we present $LCSk$++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named $LCSk$. By relaxing the requirement that the $k$-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes $LCSk$++ with complexity of $O((|X|+|Y|)\log(|X|+|Y|))$ for strings $X$ and $Y$ under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute $LCSk$ as well, which gives an improvement of the $O(|X|\dot|Y|)$ algorithm presented in the original $LCSk$ paper.

Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.