Efficiently Estimating Mutual Information Between Attributes Across Tables (2403.15553v1)
Abstract: Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.
- “NYC OpenData,” https://opendata.cityofnewyork.us.
- “City of Chicago Data Portal,” https://data.cityofchicago.org.
- “United States Government Open Data,” https://www.data.gov.
- D. Brickley, M. Burgess, and N. Noy, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” in The World Wide Web Conference, ser. WWW ’19. New York, NY, USA: ACM, 2019, pp. 1365–1375. [Online]. Available: http://doi.acm.org/10.1145/3308558.3313685
- S. Bapat, “Discover, understand and manage your data with Data Catalog, now GA,” https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available, 2020, [Online; accessed 22-June-2020].
- M. Grover, “Amundsen — Lyft’s data discovery & metadata engine,” https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9, 2019, [Online; accessed 20-October-2019].
- M. Lan, “DataHub: A generalized metadata search & discovery tool,” https://engineering.linkedin.com/blog/2019/data-hub, 2019, [Online; accessed 22-June-2020].
- B. Youngmann, M. Cafarella, Y. Moskovitch, and B. Salimi, “Nexus: On explaining confounding bias,” in Companion of the 2023 International Conference on Management of Data, 2023, pp. 171–174.
- F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire, “Data polygamy: the many-many relationships among urban spatio-temporal data sets,” in ACM SIGMOD, 2016, pp. 1011–1025.
- A. Bessa, J. Freire, T. Dasu, and D. Srivastava, “Effective discovery of meaningful outlier relationships,” ACM Transactions on Data Science, vol. 1, no. 2, pp. 1–33, 2020.
- A. Bessa, S. Castelo, R. Rampin, A. S. R. Santos, M. Shoemate, V. D’Orazio, and J. Freire, “An ecosystem of applications for modeling political violence,” in ACM SIGMOD, 2021, pp. 2384–2388.
- N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. Karger, “Arda: Automatic relational data augmentation for machine learning,” Proceedings of the VLDB Endowment, vol. 13, no. 9, 2020.
- S. Castelo, R. Rampin, A. Santos, A. Bessa, F. Chirigati, and J. Freire, “Auctus: A dataset search engine for data discovery and augmentation,” Proceedings of the VLDB Endowment, vol. 14, no. 12, pp. 2791–2794, 2021.
- E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller, “Lsh ensemble: Internet-scale domain search,” Proc. VLDB Endow., vol. 9, no. 12, p. 1185–1196, Aug. 2016. [Online]. Available: https://doi.org/10.14778/2994509.2994534
- R. Castro Fernandez, J. Min, D. Nava, and S. Madden, “Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 1190–1201.
- R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker, “Aurum: A Data Discovery System,” in ICDE ’18, 2018, pp. 1001–1012.
- E. Zhu, D. Deng, F. Nargesian, and R. J. Miller, “Josie: Overlap set similarity search for finding joinable tables in data lakes,” in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19. New York, NY, USA: ACM, 2019, pp. 847–864. [Online]. Available: http://doi.acm.org/10.1145/3299869.3300065
- F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proceedings of the VLDB Endowment, vol. 11, no. 7, pp. 813–825, 2018.
- Y. Dong and M. Oyamada, “Table enrichment system for machine learning,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3267–3271.
- A. D. Nobari and D. Rafiei, “Efficiently transforming tables for joinability,” 2022.
- Y. Yang, Y. Zhang, W. Zhang, and Z. Huang, “Gb-kmv: An augmented kmv sketch for approximate containment similarity search,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 458–469.
- M. Esmailoghli, J.-A. Quiané-Ruiz, and Z. Abedjan, “Mate: multi-attribute table extraction,” Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1684–1696, 2022.
- A. Ionescu, R. Hai, M. Fragkoulis, and A. Katsifodimos, “Join path-based data augmentation for decision trees,” in 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), 2022, pp. 84–88.
- J. Liu, C. Chai, Y. Luo, Y. Lou, J. Feng, and N. Tang, “Feature augmentation with reinforcement learning,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022, pp. 3360–3372.
- S. Galhotra, Y. Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in ICDE. IEEE, 2023.
- J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural computing and applications, vol. 24, no. 1, pp. 175–186, 2014.
- A. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire, “Correlation sketches for approximate join-correlation queries,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1531–1544.
- M. Esmailoghli, J.-A. Quiané-Ruiz, and Z. Abedjan, “Cocoa: Correlation coefficient-aware data augmentation.” in EDBT, 2021, pp. 331–336.
- A. Santos, A. Bessa, C. Musco, and J. Freire, “A sketch-based index for correlated dataset search,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2928–2941.
- C. O. Daub, R. Steuer, J. Selbig, and S. Kloska, “Estimating mutual information using b-spline functions–an improved similarity measure for analysing gene expression data,” BMC bioinformatics, vol. 5, no. 1, pp. 1–12, 2004.
- P. Mandros, M. Boley, and J. Vreeken, “Discovering reliable approximate functional dependencies,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 355–363.
- ——, “Discovering reliable correlations in categorical data,” in 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019, pp. 1252–1257.
- P. Mandros, D. Kaltenpoth, M. Boley, and J. Vreeken, “Discovering functional dependencies from mixed-type data,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1404–1414.
- F. Pennerath, P. Mandros, and J. Vreeken, “Discovering approximate functional dependencies using smoothed mutual information,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1254–1264.
- B. Youngmann, M. Cafarella, Y. Moskovitch, and B. Salimi, “On explaining confounding bias,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2023.
- K. Hlavackova-Schindler, M. Palus, M. Vejmelka, and J. Bhattacharya, “Causality detection based on information-theoretic approaches in time series analysis,” Physics Reports, vol. 441 (2007) 1 – 46, 02 2007.
- G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014, 40th-year commemorative issue. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0045790613003066
- J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, dec 2017. [Online]. Available: https://doi.org/10.1145/3136625
- M. Beraha, A. M. Metelli, M. Papini, A. Tirinzoni, and M. Restelli, “Feature selection via mutual information: New theoretical insights,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–9.
- H. Peng and Y. Fan, “Feature selection by optimizing a lower bound of conditional mutual information,” Information Sciences, vol. 418, pp. 652–667, 2017.
- G. Brown, A. Pocock, M.-J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” The journal of machine learning research, vol. 13, pp. 27–66, 2012.
- M. S. Roulston, “Estimating the errors on measured entropy and mutual information,” Physica D: Nonlinear Phenomena, vol. 125, no. 3-4, pp. 285–294, 1999.
- A. Hacine-Gharbi and P. Ravier, “A binning formula of bi-histogram for joint entropy estimation using mean square error minimization,” Pattern Recognition Letters, vol. 101, pp. 21–28, 2018.
- L. Paninski, “Estimation of entropy and mutual information,” Neural computation, vol. 15, no. 6, pp. 1191–1253, 2003.
- “scikit-learn: machine learning in python — scikit-learn 1.2.1 documentation,” https://scikit-learn.org/.
- J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015.
- A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical review E, vol. 69, no. 6, p. 066138, 2004.
- B. C. Ross, “Mutual information between discrete and continuous data sets,” PloS one, vol. 9, no. 2, p. e87357, 2014.
- W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutual information for discrete-continuous mixtures,” Advances in neural information processing systems, vol. 30, 2017.
- D. Huang, D. Y. Yoon, S. Pettie, and B. Mozafari, “Joins on samples: a theoretical guide for practitioners,” Proceedings of the VLDB Endowment, vol. 13, no. 4, pp. 547–560, 2019.
- E. Cohen, “Sampling big ideas in query optimization,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023, pp. 361–371.
- D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen, “Join size estimation subject to filter conditions,” Proc. VLDB Endow., vol. 8, no. 12, p. 1530–1541, Aug. 2015. [Online]. Available: https://doi.org/10.14778/2824032.2824051
- Y. Chen and K. Yi, “Two-level sampling for join size estimation,” in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 759–774. [Online]. Available: https://doi.org/10.1145/3035918.3035921
- V. Shah, J. Lacanlale, P. Kumar, K. Yang, and A. Kumar, “Towards benchmarking feature type inference for automl platforms,” in Proceedings of the 2021 International Conference on Management of Data, ser. SIGMOD ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 1584–1596. [Online]. Available: https://doi.org/10.1145/3448016.3457274
- V. Solo, “On causality and mutual information,” in 2008 47th IEEE Conference on Decision and Control, 2008, pp. 4939–4944.
- G. Doquire and M. Verleysen, “Feature selection with missing data using mutual information estimators,” Neurocomputing, vol. 90, pp. 3–11, 2012, advances in artificial neural networks, machine learning, and computational intelligence (ESANN 2011). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231212001841
- M. Hutter and M. Zaffalon, “Distribution of mutual information from complete and incomplete data,” Computational Statistics & Data Analysis, vol. 48, no. 3, pp. 633–657, 2005.
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “Join synopses for approximate query answering,” SIGMOD Rec., vol. 28, no. 2, p. 275–286, Jun. 1999. [Online]. Available: https://doi.org/10.1145/304181.304207
- A. Bessa, M. Daliri, J. Freire, C. Musco, C. Musco, A. Santos, and H. Zhang, “Weighted minwise hashing beats linear sketching for inner product estimation,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023.
- K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla, “On synopses for distinct-value estimation under multiset operations,” in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’07. New York, NY, USA: ACM, 2007, pp. 199–210. [Online]. Available: http://doi.acm.org/10.1145/1247480.1247504
- E. Cohen, “Coordinated sampling,” in Encyclopedia of Algorithms, 2016, pp. 449–454. [Online]. Available: https://doi.org/10.1007/978-1-4939-2864-4_576
- M. Daliri, J. Freire, C. Musco, A. Santos, and H. Zhang, “Sampling methods for inner product sketching,” arXiv preprint arXiv:2309.16157, 2023.
- C. Estan and J. F. Naughton, “End-biased samples for join cardinality estimation,” in 22nd International Conference on Data Engineering (ICDE’06), 2006, pp. 20–20.
- J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on Mathematical Software (TOMS), vol. 11, no. 1, pp. 37–57, 1985.
- C. Wang and B. Ding, “Fast approximation of empirical entropy via subsampling,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 658–667.
- X. Chen and S. Wang, “Efficient approximate algorithms for empirical entropy and mutual information,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 274–286.
- N. Duffield, C. Lund, and M. Thorup, “Priority sampling for estimation of arbitrary subset sums,” J. ACM, vol. 54, no. 6, p. 32–es, Dec. 2007. [Online]. Available: https://doi.org/10.1145/1314690.1314696
- Wikipedia contributors, “Multinomial distribution — Wikipedia, the free encyclopedia,” 2023, [Online; accessed 01-August-2023]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Multinomial_distribution&oldid=1167221208
- “World Bank Group Finances,” https://finances.worldbank.org.
- “The Socrata Open Data API,” https://dev.socrata.com.
- “The Tablesaw Library,” https://github.com/jtablesaw/tablesaw.
- A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu, “Finding related tables,” Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012.
- A. Kumar, J. Naughton, J. M. Patel, and X. Zhu, “To join or not to join? thinking twice about joins before feature selection,” in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 19–34.
- J. Becktepe, M. Esmailoghli, M. Koch, and Z. Abedjan, “Demonstrating mate and cocoa for data discovery,” in Companion of the 2023 International Conference on Management of Data, 2023, pp. 119–122.
- P. Indyk and A. McGregor, “Declaring independence via the sketching of sketches,” in Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’08. USA: Society for Industrial and Applied Mathematics, 2008, p. 737–745.
- F. Keller, E. Müller, and K. Böhm, “Estimating mutual information on data streams,” in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM ’15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2791347.2791348
- J. Boidol and A. Hapfelmeier, “Fast mutual information computation for dependency-monitoring on data streams,” in Proceedings of the Symposium on Applied Computing, ser. SAC ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 830–835. [Online]. Available: https://doi.org/10.1145/3019612.3019669
- M. Ferdosi, A. Gholamidavoodi, and H. Mohimani, “Measuring mutual information between all pairs of variables in subquadratic complexity,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Chiappa and R. Calandra, Eds., vol. 108. PMLR, 26–28 Aug 2020, pp. 4399–4409. [Online]. Available: https://proceedings.mlr.press/v108/ferdosi20a.html
- D. McAllester and K. Stratos, “Formal limitations on the measurement of mutual information,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 875–884.
- S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutual information for strongly dependent variables,” in Artificial intelligence and statistics. PMLR, 2015, pp. 277–286.