Expected Density of Random Minimizers (2410.16968v2)

Published 22 Oct 2024 in math.CO and q-bio.GN

Abstract: Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size $\sigma$, a minimizer is defined by two integers $k,w\ge2$ and a total order $\rho$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $\rho$. A key characteristic of the minimizer is the expected density of chosen $k$-mers among all $k$-mers in a random infinite $\sigma$-ary string. Random minimizers, in which the order $\rho$ is chosen uniformly at random, are often used in applications. However, little is known about their expected density $\mathcal{DR}\sigma(k,w)$ besides the fact that it is close to $\frac{2}{w+1}$ unless $w\gg k$. We first show that $\mathcal{DR}\sigma(k,w)$ can be computed in $O(k\sigma^{k+w})$ time. Then we attend to the case $w\le k$ and present a formula that allows one to compute $\mathcal{DR}\sigma(k,w)$ in just $O(w \log w)$ time. Further, we describe the behaviour of $\mathcal{DR}\sigma(k,w)$ in this case, establishing the connection between $\mathcal{DR}\sigma(k,w)$, $\mathcal{DR}\sigma(k+1,w)$, and $\mathcal{DR}\sigma(k,w+1)$. In particular, we show that $\mathcal{DR}\sigma(k,w)<\frac{2}{w+1}$ (by a tiny margin) unless $w$ is small. We conclude with some partial results and conjectures for the case $w>k$.