Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distinctiveness Maximization in Datasets Assemblage (2401.00659v4)

Published 1 Jan 2024 in cs.DB

Abstract: In this paper, given a user's query set and budget, we aim to use the limited budget to help users assemble a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness). We prove this problem to be NP-hard. A greedy algorithm using exact distinctiveness computation attains an approximation ratio of (1-1/e)/2, but it lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection. This requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient ML-based method for estimating the distinctiveness marginal gain of any candidate dataset. This effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods that were limited to single-query cardinality estimation on a single dataset and struggled with identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm, which utilizes ML-based distinctiveness estimation, outperforms all relevant baselines in effectiveness, efficiency, and scalability. A case study on two downstream ML tasks also highlights its potential to find datasets with more useful tuples to enhance the performance of ML tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. 2022. Airline dataset. https://relational.fit.cvut.cz/dataset/Airline.
  2. 2022. Amazon AWS Marketplace. https://aws.amazon.com/marketplace.
  3. 2022. IMDB dataset. https://https://homepages.cwi.nl/~boncz/job/imdb.tgz.
  4. 2022. Snowflake Data Marketplace. https://www.snowflake.com/data-marketplace/.
  5. 2022. State of New York Vehicle, snowmobile, and boat registrations. https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations.
  6. 2022. TPC-H Benchmark. http://www.tpc.org/tpch/.
  7. 2023. Source Code. https://gitfront.io/r/user-3680909/fuDbUGQtSjjF/um/.
  8. NYU Auctus. 2022. https://auctus.vida-nyu.org/
  9. Dataset Discovery in Data Lakes. In ICDE. 709–720.
  10. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365–1375.
  11. Selective Data Acquisition in the Wild for Model Charging. Proc. VLDB Endow. 15, 7 (2022), 1466–1478.
  12. Dataset search: a survey. VLDB J. 29, 1 (2020), 251–272.
  13. Leveraging aggregate constraints for deduplication. In SIGMOD. 437–448.
  14. Truthful Data Acquisition via Peer Prediction. In NeurIPS. 18194–18204.
  15. Data Cleaning: Overview and Emerging Challenges. In SIGMOD. 2201–2206.
  16. Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58–75.
  17. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 12, 9 (2019), 1044–1057.
  18. Metam: Goal-Oriented Data Discovery. In ICDE. 2780–2793.
  19. Ver: View Discovery in the Wild. In ICDE. 503–516.
  20. Deep learning. MIT press.
  21. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation. Proc. VLDB Endow. 15, 4 (2021), 752–765.
  22. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In SIGMOD. 1035–1050.
  23. US government linked open data: semantic. data. gov. IEEE Intelligent Systems 27, 03 (2012), 25–31.
  24. DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow. 13, 7 (2020), 992–1005.
  25. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD. 647–658.
  26. Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In VLDB. Morgan Kaufmann, 19–30.
  27. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow. 12, 11 (2019), 1610–1623.
  28. Maxat Kassen. 2013. A promising phenomenon of open data: A case study of the Chicago open data project. Government information quarterly 30, 4 (2013), 508–513.
  29. The Budgeted Maximum Coverage Problem. Inf. Process. Lett. 70, 1 (1999), 39–45.
  30. Learned Cardinality Estimation: An In-depth Study. In SIGMOD. 1214–1227.
  31. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR.
  32. HomeSeeker: A visual analytics system of real estate data. J. Vis. Lang. Comput. 45 (2018), 1–16.
  33. Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis. Proc. VLDB Endow. 12, 4 (2018), 362–375.
  34. Data Acquisition for Improving Machine Learning Models. Proc. VLDB Endow. 14, 10 (2021), 1832–1844.
  35. Bing-Rong Lin and Daniel Kifer. 2014. On Arbitrage-free Pricing for General Data Queries. Proc. VLDB Endow. 7, 9 (2014), 757–768.
  36. Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review 53 (2020), 1487–1509.
  37. Cardinality estimation with smoothing autoregressive models. World Wide Web 26, 5 (2023), 3441–3461.
  38. Automatic Data Acquisition for Deep Learning. Proc. VLDB Endow. 14, 12 (2021), 2739–2742.
  39. Pre-training Summarization Models of Structured Datasets for Cardinality Estimation. Proc. VLDB Endow. 15, 3 (2021), 414–426.
  40. How to Sell a Dataset?: Pricing Policies for Data Monetization. In EC. 679.
  41. Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation. In EDBT. 273–284.
  42. Viswanath Nagarajan. 2021. Approximation & Online Algorithms. http://viswa.engin.umich.edu/wp-content/uploads/sites/169/2021/02/greedy.pdf (2021).
  43. Tailoring Data Source Distributions for Fairness-aware Data Integration. Proc. VLDB Endow. 14, 11 (2021), 2519–2532.
  44. Responsible Data Integration: Next-generation Challenges. In SIGMOD. 2458–2464.
  45. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989.
  46. Robust Query Driven Cardinality Estimation under Changing Workloads. Proc. VLDB Endow. 16, 6 (2023), 1520–1533.
  47. QuickSel: Quick Selectivity Learning with Mixture Models. In SIGMOD. 1017–1033.
  48. Marketplaces for data: an initial survey. SIGMOD Rec. 42, 1 (2013), 15–26.
  49. Ji Sun and Guoliang Li. 2019. An End-to-End Learning-based Cost Estimator. Proc. VLDB Endow. 13, 3 (2019), 307–319.
  50. Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In SIGMOD. 1771–1783.
  51. Efficiently adapting graphical models for selectivity estimation. VLDB J. 22, 1 (2013), 3–27.
  52. Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation. Proc. ACM Manag. Data 1, 1 (2023), 28:1–28:25.
  53. Towards a Learning Optimizer for Shared Clouds. Proc. VLDB Endow. 12, 3 (2018), 210–222.
  54. BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation. CoRR abs/2012.14743 (2020).
  55. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proc. VLDB Endow. 14, 11 (2021), 2563–2575.
  56. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 14, 1 (2020), 61–73.
  57. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. 13, 3 (2019), 279–292.
  58. Data Valuation using Reinforcement Learning. In ICML, Vol. 119. 10842–10851.
  59. Data-centric Artificial Intelligence: A Survey. CoRR abs/2303.10158 (2023).
  60. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14, 9 (2021), 1489–1502.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com