Approximating Large Frequency Moments with Pick-and-Drop Sampling (1212.0202v2)
Abstract: Given data stream $D = {p_1,p_2,...,p_m}$ of size $m$ of numbers from ${1,..., n}$, the frequency of $i$ is defined as $f_i = |{j: p_j = i}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}n f_ik$. We consider the problem of approximating frequency moments in insertion-only streams for $k\ge 3$. For any constant $c$ we show an $O(n{1-2/k}\log(n)\log{(c)}(n))$ upper bound on the space complexity of the problem. Here $\log{(c)}(n)$ is the iterative $\log$ function. To simplify the presentation, we make the following assumptions: $n$ and $m$ are polynomially far; approximation error $\epsilon$ and parameter $k$ are constants. We observe a natural bijection between streams and special matrices. Our main technical contribution is a non-uniform sampling method on matrices. We call our method a \emph{pick-and-drop sampling}; it samples a heavy element (i.e., element $i$ with frequency $\Omega(F_k)$) with probability $\Omega(1/n{1-2/k})$ and gives approximation $\tilde{f_i} \ge (1-\epsilon)f_i$. In addition, the estimations never exceed the real values, that is $ \tilde{f_j} \le f_j$ for all $j$. As a result, we reduce the space complexity of finding a heavy element to $O(n{1-2/k}\log(n))$ bits. We apply our method of recursive sketches and resolve the problem with $O(n{1-2/k}\log(n)\log{(c)}(n))$ bits.