Optimal quantile estimation: beyond the comparison model (2404.03847v1)
Abstract: Estimating quantiles is one of the foundational problems of data sketching. Given $n$ elements $x_1, x_2, \dots, x_n$ from some universe of size $U$ arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most $\varepsilon n$. A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using $O(\varepsilon{-1} \log n)$ words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using $O(\varepsilon{-1} \log\log(1/\delta))$ words (randomized, with failure probability $\delta$). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of $O(\varepsilon{-1}\log U)$ words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using $O(\varepsilon{-1})$ words of space (which is optimal as long as $n \leq \mathrm{poly}(U)$). In this work, we present a deterministic algorithm using $O(\varepsilon{-1})$ words, resolving this line of work.
- Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):1–28, 2013.
- On the amortized complexity of approximate counting. arXiv preprint arXiv:2211.03917, 2022.
- Generalizing Greenwald-Khanna streaming quantile summaries for weighted inputs. arXiv preprint arXiv:2303.06288, 2023.
- Approximate counts and quantiles over sliding windows. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 286–296, 2004.
- Quantiles Sketch Overview - Apache DataSketches. https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html. Accessed: 2024-03-26.
- A one-pass algorithm for accurately estimating quantiles for disk-resident data. In Very Large Data Bases Conference, 1997.
- Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383–1394, 2015.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
- Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 96–108, 2021.
- Space-and time-efficient deterministic algorithms for biased quantiles over data streams. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263–272, 2006.
- Theory meets practice at the median: A worst case comparison of relative error quantile algorithms. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2722–2731, 2021.
- A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 81–93, 2020.
- Computing extremely accurate quantiles using t-digests. arXiv preprint arXiv:1902.04023, 2019.
- A randomized online quantile summary in O((1/ε)log(1/ε))𝑂1𝜀1𝜀{O}((1/\varepsilon)\log(1/\varepsilon))italic_O ( ( 1 / italic_ε ) roman_log ( 1 / italic_ε ) ) words. Theory of Computing, 13(1):1–17, 2017.
- Moment-based quantile sketches for efficient high cardinality aggregation queries. Proceedings of the VLDB Endowment, 11(11), 2018.
- Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58–66, 2001.
- Quantiles and equi-depth histograms over streams. In Data Stream Management: Processing High-Speed Data Streams, pages 45–86. Springer, 2016.
- Approximate aggregate functions - GoogleSQL. https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions. Accessed: 2024-03-26.
- Counting inversions in lists. In SODA, volume 3, pages 253–254, 2003.
- https://hotlix.com/product/scorpion-suckers/, Nov 2022.
- Optimal quantile approximation in streams. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 71–78. IEEE, 2016.
- Selection and sorting with limited storage. Theoretical computer science, 12(3):315–323, 1980.
- Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426–435, 1998.
- Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693, 2019.
- Andrzej Pelc. Searching games with errors—fifty years of coping with liars. Theoretical Computer Science, 270(1-2):71–109, 2002.
- Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of the 2nd international conference on Embedded networked sensor systems, pages 239–249, 2004.
- Qi Zhang and Wei Wang. An efficient algorithm for approximate biased quantile computation in data streams. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 1023–1026, 2007.