Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Holistic Cube Analysis: A Query Framework for Data Insights (2302.00120v4)

Published 31 Jan 2023 in cs.DB, cs.DC, and cs.PL

Abstract: Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type AbstractCube, defined as a function which maps a region-features pair to a relational table (a region is a tuple which specifies values of a set of dimensions). AbstractCube provides a logical form of data, and HoCA operators are cube-to-cube transformations. We describe two basic but fundamental HoCA operators, cube crawling and cube join (with many possible extensions). Cube crawling explores a region space, and outputs a cube that maps regions to signal vectors. Cube join, in turn, is critical for composition, allowing one to join information from different cubes for deeper analysis. Cube crawling introduces two novel programming features, (programmable) Region Analysis Models (RAMs) and Multi-Model Crawling. Crucially, RAM has a notion of population features, which allows one to go beyond only analyzing local features at a region, and program region-population analysis that compares region and population features, capturing a large class of importance notions. HoCA has a rich algorithmic space, such as optimizing crawling and join performance, and physical design of cubes. We have implemented and deployed HoCA at Google. Our early HoCA offering has attracted more than 30 teams building applications with it, across a diverse spectrum of fields including system monitoring, experimentation analysis, and business intelligence. For many applications, HoCA empowers novel and powerful analyses, such as instances of recurrent crawling, which are challenging to achieve otherwise.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Multidimensional Expressions (MDX). https://learn.microsoft.com/en-us/sql/mdx/multidimensional-expressions-mdx-reference?view=sql-server-ver16.
  2. Python Bayesian Structural Time Series. https://pypi.org/project/pybsts/.
  3. Apache Beam Python SDK. https://beam.apache.org/documentation/sdks/python/.
  4. Macrobase: Prioritizing attention in fast data. ACM Trans. Database Syst., 43(4):15:1–15:45, 2018.
  5. DIFF: a relational interface for large-scale data explanation. VLDB J., 30(1):45–70, 2021. doi: 10.1007/s00778-020-00633-6. URL https://doi.org/10.1007/s00778-020-00633-6.
  6. Values of Non-Atomic Games. Princeton University Press, 2015. ISBN 9781400867080. doi: doi:10.1515/9781400867080. URL https://doi.org/10.1515/9781400867080.
  7. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  8. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247–274, 2015.
  9. Optimal aggregation algorithms for middleware. CoRR, cs.DB/0204046, 2002. URL https://arxiv.org/abs/cs/0204046.
  10. Multi-structural databases. In Chen Li (ed.), Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pp.  184–195. ACM, 2005a.
  11. Multi-structural databases. In Chen Li (ed.), Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pp.  184–195. ACM, 2005b.
  12. Sebastian Raschka. Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack. The Journal of Open Source Software, 3(24), April 2018. doi: 10.21105/joss.00638. URL http://joss.theoj.org/papers/10.21105/joss.00638.
  13. The cascading analysts algorithm. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (eds.), Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp. 1083–1096. ACM, 2018. doi: 10.1145/3183713.3183745. URL https://doi.org/10.1145/3183713.3183745.
  14. F1 query: Declarative querying at scale. pp.  1835–1848, 2018. URL http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf.
  15. Yi Sun and Mukund Sundararajan. Axiomatic attribution for multilinear functions. In Yoav Shoham, Yan Chen, and Tim Roughgarden (eds.), Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), San Jose, CA, USA, June 5-9, 2011, pp.  177–178. ACM, 2011. doi: 10.1145/1993574.1993601. URL https://doi.org/10.1145/1993574.1993601.
  16. Data x-ray: A diagnostic tool for data errors. In Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (eds.), Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp.  1231–1245. ACM, 2015.
  17. Scorpion: Explaining away outliers in aggregate queries. Proc. VLDB Endow., 6(8):553–564, 2013.
  18. Holistic cube analysis: A query framework for data insights, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.