Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Range (Rényi) Entropy Queries and Partitioning (2312.15959v2)

Published 26 Dec 2023 in cs.DS and cs.DB

Abstract: Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the R\'enyi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the R\'enyi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}d$. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. R\'enyi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. R\'enyi) entropy in $P\cap R$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. P. Afshani and J. M. Phillips. Independent range sampling, revisited again. In 35th International Symposium on Computational Geometry (SoCG 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
  2. P. Afshani and Z. Wei. Independent range sampling, revisited. In 25th Annual European Symposium on Algorithms (ESA 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
  3. Range-max queries on uncertain data. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 465–476, 2016.
  4. Range-max queries on uncertain data. Journal of Computer and System Sciences, 94:118–134, 2018.
  5. Multi-dimensional histograms with tight bounds for the error. In 2006 10th International Database Engineering and Applications Symposium (IDEAS’06), pages 105–112. IEEE, 2006.
  6. Coolcat: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pages 582–589, 2002.
  7. The complexity of approximating entropy. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 678–687, 2002.
  8. Clustering users by their mobility behavioral patterns. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(4):1–28, 2019.
  9. Decomposable searching problems i. static-to-dynamic transformation. Journal of Algorithms, 1(4):301–358, 1980.
  10. Computational geometry. In Computational geometry, pages 1–17. Springer, 1997.
  11. L. Bhuvanagiri and S. Ganguly. Estimating entropy over data streams. In Algorithms–ESA 2006: 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006. Proceedings 14, pages 148–159. Springer, 2006.
  12. Optimal bounds for estimating entropy with pmf queries. In International Symposium on Mathematical Foundations of Computer Science, pages 187–198. Springer, 2015.
  13. C. Canonne and R. Rubinfeld. Testing probability distributions underlying aggregated data. In International Colloquium on Automata, Languages, and Programming, pages 283–295. Springer, 2014.
  14. A near-optimal algorithm for computing the entropy of a stream. In SODA, volume 7, pages 328–335. Citeseer, 2007.
  15. Estimating entropy and entropy norm on data streams. Internet Mathematics, 3(1):63–78, 2006.
  16. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1247–1261, 2015.
  17. P. Clifford and I. Cosma. A simple sketching algorithm for entropy estimation over streaming data. In Artificial Intelligence and Statistics, pages 196–206. PMLR, 2013.
  18. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 4(1–3):1–294, 2011.
  19. Entropy based community detection in augmented social networks. In 2011 International Conference on computational aspects of social networks (CASoN), pages 163–168. IEEE, 2011.
  20. Two-dimensional range diameter queries. In Latin American Symposium on Theoretical Informatics, pages 219–230. Springer, 2012.
  21. Computational Geometry: Algorithms and Applications. Springer, 3rd edition, 2008.
  22. J. Erickson. Static-to-dynamic transformations. http://jeffe.cs.illinois.edu/teaching/datastructures/notes/01-statictodynamic.pdf.
  23. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems (TODS), 31(1):396–438, 2006.
  24. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 733–742, 2006.
  25. Further results on generalized intersection searching problems: counting, reporting, and dynamization. Journal of Algorithms, 19(2):282–317, 1995.
  26. Sketching and streaming entropy via approximation theory. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 489–498. IEEE, 2008.
  27. Independent range sampling. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 246–255, 2014.
  28. P. Li and C.-H. Zhang. A new algorithm for compressed counting with applications in shannon entropy estimation in dynamic data. In Proceedings of the 24th Annual Conference on Learning Theory, pages 477–496. JMLR Workshop and Conference Proceedings, 2011.
  29. Entropy-based criterion in categorical clustering. In Proceedings of the twenty-first international conference on Machine learning, page 68, 2004.
  30. JanusAQP: Efficient partition tree maintenance for dynamic approximate query processing. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 572–584. IEEE, 2023.
  31. Combining aggregation and sampling (nearly) optimally for approximate query processing. In Proceedings of the 2021 International Conference on Management of Data, pages 1129–1141, 2021.
  32. A. L. Martinez. Parallel minimum cuts: An improved crew pram algorithm. Master’s thesis. KTH, School of Electrical Engineering and Computer Science (EECS), 2020.
  33. M. H. Overmars. The design of dynamic data structures, volume 156. Springer Science & Business Media, 1983.
  34. Worst-case optimal insertion and deletion methods for decomposable searching problems. Information Processing Letters, 12(4):168–173, 1981.
  35. M. Patrascu and L. Roditty. Distance oracles beyond the thorup-zwick bound. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 815–823. IEEE, 2010.
  36. S. Rahul and R. Janardan. Algorithms for range-skyline queries. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pages 526–529, 2012.
  37. Y. Tao. Algorithmic techniques for independent query sampling. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 129–138, 2022.
  38. Entropy-based histograms for selectivity estimation. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1939–1948, 2013.
  39. Spatial online sampling and aggregation. Proceedings of the VLDB Endowment, 9(3):84–95, 2015.
  40. Spatial independent range sampling. In Proceedings of the 2021 International Conference on Management of Data, pages 2023–2035, 2021.

Summary

We haven't generated a summary for this paper yet.