Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal quantile estimation: beyond the comparison model (2404.03847v1)

Published 5 Apr 2024 in cs.DS

Abstract: Estimating quantiles is one of the foundational problems of data sketching. Given $n$ elements $x_1, x_2, \dots, x_n$ from some universe of size $U$ arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most $\varepsilon n$. A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using $O(\varepsilon{-1} \log n)$ words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using $O(\varepsilon{-1} \log\log(1/\delta))$ words (randomized, with failure probability $\delta$). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of $O(\varepsilon{-1}\log U)$ words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using $O(\varepsilon{-1})$ words of space (which is optimal as long as $n \leq \mathrm{poly}(U)$). In this work, we present a deterministic algorithm using $O(\varepsilon{-1})$ words, resolving this line of work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):1–28, 2013.
  2. On the amortized complexity of approximate counting. arXiv preprint arXiv:2211.03917, 2022.
  3. Generalizing Greenwald-Khanna streaming quantile summaries for weighted inputs. arXiv preprint arXiv:2303.06288, 2023.
  4. Approximate counts and quantiles over sliding windows. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 286–296, 2004.
  5. Quantiles Sketch Overview - Apache DataSketches. https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html. Accessed: 2024-03-26.
  6. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In Very Large Data Bases Conference, 1997.
  7. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383–1394, 2015.
  8. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  9. Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 96–108, 2021.
  10. Space-and time-efficient deterministic algorithms for biased quantiles over data streams. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263–272, 2006.
  11. Theory meets practice at the median: A worst case comparison of relative error quantile algorithms. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2722–2731, 2021.
  12. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 81–93, 2020.
  13. Computing extremely accurate quantiles using t-digests. arXiv preprint arXiv:1902.04023, 2019.
  14. A randomized online quantile summary in O⁢((1/ε)⁢log⁡(1/ε))𝑂1𝜀1𝜀{O}((1/\varepsilon)\log(1/\varepsilon))italic_O ( ( 1 / italic_ε ) roman_log ( 1 / italic_ε ) ) words. Theory of Computing, 13(1):1–17, 2017.
  15. Moment-based quantile sketches for efficient high cardinality aggregation queries. Proceedings of the VLDB Endowment, 11(11), 2018.
  16. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58–66, 2001.
  17. Quantiles and equi-depth histograms over streams. In Data Stream Management: Processing High-Speed Data Streams, pages 45–86. Springer, 2016.
  18. Approximate aggregate functions - GoogleSQL. https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions. Accessed: 2024-03-26.
  19. Counting inversions in lists. In SODA, volume 3, pages 253–254, 2003.
  20. https://hotlix.com/product/scorpion-suckers/, Nov 2022.
  21. Optimal quantile approximation in streams. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 71–78. IEEE, 2016.
  22. Selection and sorting with limited storage. Theoretical computer science, 12(3):315–323, 1980.
  23. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426–435, 1998.
  24. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693, 2019.
  25. Andrzej Pelc. Searching games with errors—fifty years of coping with liars. Theoretical Computer Science, 270(1-2):71–109, 2002.
  26. Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of the 2nd international conference on Embedded networked sensor systems, pages 239–249, 2004.
  27. Qi Zhang and Wei Wang. An efficient algorithm for approximate biased quantile computation in data streams. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 1023–1026, 2007.
Citations (1)

Summary

  • The paper introduces a deterministic algorithm that transcends traditional comparison-based models by achieving optimal space complexity in quantile estimation.
  • It employs a forest-based, tree-like data structure combined with approximate counting to reduce overhead in data stream processing.
  • The approach enhances practical applications in systems like Apache DataSketches, ensuring robust, reliable performance in adversarial environments.

Optimal Quantile Estimation: Beyond the Comparison Model

This paper presents a new quantile estimation algorithm that breaks through the traditional comparison-based model to achieve optimal space complexity in terms of additive error approximation. Specifically, it addresses the problem of quantile sketching with a focus on optimizing space usage. Existing solutions to this problem, namely the GK sketch and KLL sketch, are described as optimal but only within the confines of a comparison-based model, which limits their application to scenarios involving only comparisons between data elements rather than dealing directly with the universe from which the elements are drawn.

The proposed algorithm innovatively utilizes a deterministic approach to achieve space complexity of O(1)O(^{-1}) words by incorporating structural modifications to the tree-like data representation. The use of a forest instead of a single tree in the recursive layer allows the algorithm to save substantial space, especially at lower layers. Leveraging this new tree structure, the authors demonstrate a concise method for storing non-empty nodes, which when combined with a strategy for approximate counting of elements, reduces unnecessary overhead.

Theoretical implications of this research center on the considered model where typical data streams consist of integers—potentially having elements from universe sizes much smaller than the streams themselves. This context allows for improvements over purely comparison-based sketches by taking advantage of properties characterizing streams of integers. By achieving the space complexity of O(1)O(^{-1}), the authors essentially resolve a longstanding open question regarding the optimality of quantile sketches beyond the bounds of comparison-based limitations.

The deterministic nature of this algorithm not only contributes to reducing the complexity in realistic data stream environments but also enhances robustness against adversarial data streams—a feature less reliably supported by randomized sketches such as KLL. Furthermore, this deterministic characteristic ensures reliability across a broader spectrum of practical applications where there is a need for consistent accuracy and performance, such as large database management systems and network traffic analysis platforms.

From a practical standpoint, this work may enable the efficient integration of accurate quantile estimation in onboarded systems like Apache DataSketches or GoogleSQL. The implications include not only improved metrics for operation efficiency or network load management but also a shift towards embedding more sophisticated data processing techniques directly within consumer-grade systems, possibly democratizing access to complex statistical computations.

The role of data universality is particularly interesting in this paper, suggesting potential exploration in adaptation or integration of this work with existing frameworks that manage diverse multi-source data streams. Additionally, the partially mergeable nature of the algorithm presents a further avenue for research on its extensive applicability to distributed systems and big data platforms where scalability and resource optimization are paramount.

In conclusion, the contribution this paper makes highlights an advancement in our understanding of quantile estimation beyond comparison-based models, potentially setting a new baseline in the field with improved structural designs and deterministic methodologies. The potential for further development and integration suggests this work could form a foundational base for future research exploring the extension of statistical estimation into dynamic and resource-constrained environments.