Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficiently Estimating Mutual Information Between Attributes Across Tables (2403.15553v1)

Published 22 Mar 2024 in cs.DB

Abstract: Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. “NYC OpenData,” https://opendata.cityofnewyork.us.
  2. “City of Chicago Data Portal,” https://data.cityofchicago.org.
  3. “United States Government Open Data,” https://www.data.gov.
  4. D. Brickley, M. Burgess, and N. Noy, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” in The World Wide Web Conference, ser. WWW ’19.   New York, NY, USA: ACM, 2019, pp. 1365–1375. [Online]. Available: http://doi.acm.org/10.1145/3308558.3313685
  5. S. Bapat, “Discover, understand and manage your data with Data Catalog, now GA,” https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available, 2020, [Online; accessed 22-June-2020].
  6. M. Grover, “Amundsen — Lyft’s data discovery & metadata engine,” https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9, 2019, [Online; accessed 20-October-2019].
  7. M. Lan, “DataHub: A generalized metadata search & discovery tool,” https://engineering.linkedin.com/blog/2019/data-hub, 2019, [Online; accessed 22-June-2020].
  8. B. Youngmann, M. Cafarella, Y. Moskovitch, and B. Salimi, “Nexus: On explaining confounding bias,” in Companion of the 2023 International Conference on Management of Data, 2023, pp. 171–174.
  9. F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire, “Data polygamy: the many-many relationships among urban spatio-temporal data sets,” in ACM SIGMOD, 2016, pp. 1011–1025.
  10. A. Bessa, J. Freire, T. Dasu, and D. Srivastava, “Effective discovery of meaningful outlier relationships,” ACM Transactions on Data Science, vol. 1, no. 2, pp. 1–33, 2020.
  11. A. Bessa, S. Castelo, R. Rampin, A. S. R. Santos, M. Shoemate, V. D’Orazio, and J. Freire, “An ecosystem of applications for modeling political violence,” in ACM SIGMOD, 2021, pp. 2384–2388.
  12. N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. Karger, “Arda: Automatic relational data augmentation for machine learning,” Proceedings of the VLDB Endowment, vol. 13, no. 9, 2020.
  13. S. Castelo, R. Rampin, A. Santos, A. Bessa, F. Chirigati, and J. Freire, “Auctus: A dataset search engine for data discovery and augmentation,” Proceedings of the VLDB Endowment, vol. 14, no. 12, pp. 2791–2794, 2021.
  14. E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller, “Lsh ensemble: Internet-scale domain search,” Proc. VLDB Endow., vol. 9, no. 12, p. 1185–1196, Aug. 2016. [Online]. Available: https://doi.org/10.14778/2994509.2994534
  15. R. Castro Fernandez, J. Min, D. Nava, and S. Madden, “Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 1190–1201.
  16. R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker, “Aurum: A Data Discovery System,” in ICDE ’18, 2018, pp. 1001–1012.
  17. E. Zhu, D. Deng, F. Nargesian, and R. J. Miller, “Josie: Overlap set similarity search for finding joinable tables in data lakes,” in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19.   New York, NY, USA: ACM, 2019, pp. 847–864. [Online]. Available: http://doi.acm.org/10.1145/3299869.3300065
  18. F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proceedings of the VLDB Endowment, vol. 11, no. 7, pp. 813–825, 2018.
  19. Y. Dong and M. Oyamada, “Table enrichment system for machine learning,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3267–3271.
  20. A. D. Nobari and D. Rafiei, “Efficiently transforming tables for joinability,” 2022.
  21. Y. Yang, Y. Zhang, W. Zhang, and Z. Huang, “Gb-kmv: An augmented kmv sketch for approximate containment similarity search,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 458–469.
  22. M. Esmailoghli, J.-A. Quiané-Ruiz, and Z. Abedjan, “Mate: multi-attribute table extraction,” Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1684–1696, 2022.
  23. A. Ionescu, R. Hai, M. Fragkoulis, and A. Katsifodimos, “Join path-based data augmentation for decision trees,” in 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), 2022, pp. 84–88.
  24. J. Liu, C. Chai, Y. Luo, Y. Lou, J. Feng, and N. Tang, “Feature augmentation with reinforcement learning,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE).   IEEE, 2022, pp. 3360–3372.
  25. S. Galhotra, Y. Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in ICDE.   IEEE, 2023.
  26. J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural computing and applications, vol. 24, no. 1, pp. 175–186, 2014.
  27. A. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire, “Correlation sketches for approximate join-correlation queries,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1531–1544.
  28. M. Esmailoghli, J.-A. Quiané-Ruiz, and Z. Abedjan, “Cocoa: Correlation coefficient-aware data augmentation.” in EDBT, 2021, pp. 331–336.
  29. A. Santos, A. Bessa, C. Musco, and J. Freire, “A sketch-based index for correlated dataset search,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2928–2941.
  30. C. O. Daub, R. Steuer, J. Selbig, and S. Kloska, “Estimating mutual information using b-spline functions–an improved similarity measure for analysing gene expression data,” BMC bioinformatics, vol. 5, no. 1, pp. 1–12, 2004.
  31. P. Mandros, M. Boley, and J. Vreeken, “Discovering reliable approximate functional dependencies,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 355–363.
  32. ——, “Discovering reliable correlations in categorical data,” in 2019 IEEE International Conference on Data Mining (ICDM).   IEEE, 2019, pp. 1252–1257.
  33. P. Mandros, D. Kaltenpoth, M. Boley, and J. Vreeken, “Discovering functional dependencies from mixed-type data,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1404–1414.
  34. F. Pennerath, P. Mandros, and J. Vreeken, “Discovering approximate functional dependencies using smoothed mutual information,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1254–1264.
  35. B. Youngmann, M. Cafarella, Y. Moskovitch, and B. Salimi, “On explaining confounding bias,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE).   IEEE, 2023.
  36. K. Hlavackova-Schindler, M. Palus, M. Vejmelka, and J. Bhattacharya, “Causality detection based on information-theoretic approaches in time series analysis,” Physics Reports, vol. 441 (2007) 1 – 46, 02 2007.
  37. G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014, 40th-year commemorative issue. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0045790613003066
  38. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, dec 2017. [Online]. Available: https://doi.org/10.1145/3136625
  39. M. Beraha, A. M. Metelli, M. Papini, A. Tirinzoni, and M. Restelli, “Feature selection via mutual information: New theoretical insights,” in 2019 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2019, pp. 1–9.
  40. H. Peng and Y. Fan, “Feature selection by optimizing a lower bound of conditional mutual information,” Information Sciences, vol. 418, pp. 652–667, 2017.
  41. G. Brown, A. Pocock, M.-J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” The journal of machine learning research, vol. 13, pp. 27–66, 2012.
  42. M. S. Roulston, “Estimating the errors on measured entropy and mutual information,” Physica D: Nonlinear Phenomena, vol. 125, no. 3-4, pp. 285–294, 1999.
  43. A. Hacine-Gharbi and P. Ravier, “A binning formula of bi-histogram for joint entropy estimation using mean square error minimization,” Pattern Recognition Letters, vol. 101, pp. 21–28, 2018.
  44. L. Paninski, “Estimation of entropy and mutual information,” Neural computation, vol. 15, no. 6, pp. 1191–1253, 2003.
  45. “scikit-learn: machine learning in python — scikit-learn 1.2.1 documentation,” https://scikit-learn.org/.
  46. J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015.
  47. A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical review E, vol. 69, no. 6, p. 066138, 2004.
  48. B. C. Ross, “Mutual information between discrete and continuous data sets,” PloS one, vol. 9, no. 2, p. e87357, 2014.
  49. W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutual information for discrete-continuous mixtures,” Advances in neural information processing systems, vol. 30, 2017.
  50. D. Huang, D. Y. Yoon, S. Pettie, and B. Mozafari, “Joins on samples: a theoretical guide for practitioners,” Proceedings of the VLDB Endowment, vol. 13, no. 4, pp. 547–560, 2019.
  51. E. Cohen, “Sampling big ideas in query optimization,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023, pp. 361–371.
  52. D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen, “Join size estimation subject to filter conditions,” Proc. VLDB Endow., vol. 8, no. 12, p. 1530–1541, Aug. 2015. [Online]. Available: https://doi.org/10.14778/2824032.2824051
  53. Y. Chen and K. Yi, “Two-level sampling for join size estimation,” in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD ’17.   New York, NY, USA: Association for Computing Machinery, 2017, p. 759–774. [Online]. Available: https://doi.org/10.1145/3035918.3035921
  54. V. Shah, J. Lacanlale, P. Kumar, K. Yang, and A. Kumar, “Towards benchmarking feature type inference for automl platforms,” in Proceedings of the 2021 International Conference on Management of Data, ser. SIGMOD ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 1584–1596. [Online]. Available: https://doi.org/10.1145/3448016.3457274
  55. V. Solo, “On causality and mutual information,” in 2008 47th IEEE Conference on Decision and Control, 2008, pp. 4939–4944.
  56. G. Doquire and M. Verleysen, “Feature selection with missing data using mutual information estimators,” Neurocomputing, vol. 90, pp. 3–11, 2012, advances in artificial neural networks, machine learning, and computational intelligence (ESANN 2011). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231212001841
  57. M. Hutter and M. Zaffalon, “Distribution of mutual information from complete and incomplete data,” Computational Statistics & Data Analysis, vol. 48, no. 3, pp. 633–657, 2005.
  58. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “Join synopses for approximate query answering,” SIGMOD Rec., vol. 28, no. 2, p. 275–286, Jun. 1999. [Online]. Available: https://doi.org/10.1145/304181.304207
  59. A. Bessa, M. Daliri, J. Freire, C. Musco, C. Musco, A. Santos, and H. Zhang, “Weighted minwise hashing beats linear sketching for inner product estimation,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023.
  60. K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla, “On synopses for distinct-value estimation under multiset operations,” in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’07.   New York, NY, USA: ACM, 2007, pp. 199–210. [Online]. Available: http://doi.acm.org/10.1145/1247480.1247504
  61. E. Cohen, “Coordinated sampling,” in Encyclopedia of Algorithms, 2016, pp. 449–454. [Online]. Available: https://doi.org/10.1007/978-1-4939-2864-4_576
  62. M. Daliri, J. Freire, C. Musco, A. Santos, and H. Zhang, “Sampling methods for inner product sketching,” arXiv preprint arXiv:2309.16157, 2023.
  63. C. Estan and J. F. Naughton, “End-biased samples for join cardinality estimation,” in 22nd International Conference on Data Engineering (ICDE’06), 2006, pp. 20–20.
  64. J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on Mathematical Software (TOMS), vol. 11, no. 1, pp. 37–57, 1985.
  65. C. Wang and B. Ding, “Fast approximation of empirical entropy via subsampling,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 658–667.
  66. X. Chen and S. Wang, “Efficient approximate algorithms for empirical entropy and mutual information,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 274–286.
  67. N. Duffield, C. Lund, and M. Thorup, “Priority sampling for estimation of arbitrary subset sums,” J. ACM, vol. 54, no. 6, p. 32–es, Dec. 2007. [Online]. Available: https://doi.org/10.1145/1314690.1314696
  68. Wikipedia contributors, “Multinomial distribution — Wikipedia, the free encyclopedia,” 2023, [Online; accessed 01-August-2023]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Multinomial_distribution&oldid=1167221208
  69. “World Bank Group Finances,” https://finances.worldbank.org.
  70. “The Socrata Open Data API,” https://dev.socrata.com.
  71. “The Tablesaw Library,” https://github.com/jtablesaw/tablesaw.
  72. A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu, “Finding related tables,” Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012.
  73. A. Kumar, J. Naughton, J. M. Patel, and X. Zhu, “To join or not to join? thinking twice about joins before feature selection,” in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 19–34.
  74. J. Becktepe, M. Esmailoghli, M. Koch, and Z. Abedjan, “Demonstrating mate and cocoa for data discovery,” in Companion of the 2023 International Conference on Management of Data, 2023, pp. 119–122.
  75. P. Indyk and A. McGregor, “Declaring independence via the sketching of sketches,” in Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’08.   USA: Society for Industrial and Applied Mathematics, 2008, p. 737–745.
  76. F. Keller, E. Müller, and K. Böhm, “Estimating mutual information on data streams,” in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM ’15.   New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2791347.2791348
  77. J. Boidol and A. Hapfelmeier, “Fast mutual information computation for dependency-monitoring on data streams,” in Proceedings of the Symposium on Applied Computing, ser. SAC ’17.   New York, NY, USA: Association for Computing Machinery, 2017, p. 830–835. [Online]. Available: https://doi.org/10.1145/3019612.3019669
  78. M. Ferdosi, A. Gholamidavoodi, and H. Mohimani, “Measuring mutual information between all pairs of variables in subquadratic complexity,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Chiappa and R. Calandra, Eds., vol. 108.   PMLR, 26–28 Aug 2020, pp. 4399–4409. [Online]. Available: https://proceedings.mlr.press/v108/ferdosi20a.html
  79. D. McAllester and K. Stratos, “Formal limitations on the measurement of mutual information,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 875–884.
  80. S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutual information for strongly dependent variables,” in Artificial intelligence and statistics.   PMLR, 2015, pp. 277–286.

Summary

We haven't generated a summary for this paper yet.