Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finding Subcube Heavy Hitters in Analytics Data Streams (1708.05159v2)

Published 17 Aug 2017 in cs.DS

Abstract: Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of $d$-dimensional items $x_1,\ldots,x_m \in [n]d$. A $k$-dimensional subcube $T$ is a subset of distinct coordinates ${ T_1,\cdots,T_k } \subseteq [d]$. A subcube heavy hitter query ${\rm Query}(T,v)$, $v \in [n]k$, outputs YES if $f_T(v) \geq \gamma$ and NO if $f_T(v) < \gamma/4$, where $f_T$ is the ratio of number of stream items whose coordinates $T$ have joint values $v$. The all subcube heavy hitters query ${\rm AllQuery}(T)$ outputs all joint values $v$ that return YES to ${\rm Query}(T,v)$. The one dimensional version of this problem where $d=1$ was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in $\tilde{O}(kd/\gamma)$ space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is $\Theta(d2/\gamma)$ which is prohibitive for large $d$, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, $\tilde{O}(d/\gamma)$-space algorithm for our problem, and a fast algorithm for answering ${\rm AllQuery}(T)$ in $O(k/\gamma2)$ time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.

Summary

We haven't generated a summary for this paper yet.