KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science (2303.02204v4)
Abstract: In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. In this process, a myriad of artifacts (datasets, pipeline scripts, etc.) are created. However, there has been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists recover information and expertise from colleagues or learn via trial and error. Hence, this paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML. It shows that KGLiDS is significantly faster with a lower memory footprint than the state-of-the-art systems while achieving comparable or better accuracy.
- 2022. WALA Tool. Accessed: 2022-07-15. https://wala.github.io
- 2023. The GraphDB RDF Engine. Accessed: 2023-12-01. https://www.ontotext.com/products/graphdb/
- 2023. Kaggle Portal. Accessed: 2023-12-01. https://www.kaggle.com/
- 2024. Kaggle ML Survey 2022. Accessed: 2024-02-01. https://www.kaggle.com/kaggle-survey-2022
- A Toolkit for Generating Code Knowledge Graphs. Proceedings of Knowledge Capture Conference (K-CAP) (2021). https://doi.org/10.1145/3460210.3493578
- Lusail: A System for Querying Linked Data at Scale. Proceedings of the VLDB Endowment, (PVLDB) (2017), 485–498. http://www.vldb.org/pvldb/vol11/p485-abdelaziz.pdf
- Awny Alnusair and Tian Zhao. 2010. Component search and reuse: An ontology-based approach. In Proceedings of IEEE International Conference on Information Reuse & Integration, (IRI). 258–261. https://doi.org/10.1109/IRI.2010.5558931
- Mattia Atzeni and Maurizio Atzori. 2017. CodeOntology: RDF-ization of Source Code. In Proceedings of International Semantic Web Conference, (ISWC). 20–28. https://doi.org/10.1007/978-3-319-68204-4_2
- DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.html
- Dataset Discovery in Data Lakes. In Proceedings of International Conference on Data Engineering (ICDE). 709–720. https://doi.org/10.1109/ICDE48307.2020.00067
- Reasoning on web data: Algorithms and performance. In Proceedings of the IEEE International Conference on Data Engineering (ICDE). 1541–1544. https://doi.org/10.1109/ICDE.2015.7113422
- Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, 1335–1349. https://doi.org/10.1145/3318464.3389742
- Named graphs, provenance and trust. In Proceedings of the international conference on World Wide Web (WWW). 613–622. https://doi.org/10.1145/1060745.1060835
- TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (nov 2020), 307–319. https://doi.org/10.14778/3430915.3430921
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
- Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 16, 7 (2023), 1726–1739. https://doi.org/10.14778/3587136.3587146
- Aurum: A Data Discovery System. In Proceedings of International Conference on Data Engineering (ICDE). 1001–1012. https://doi.org/10.1109/ICDE.2018.00094
- Efficient and Robust Automated Machine Learning. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS). 2962–2970. https://dl.acm.org/doi/10.5555/2969442.2969547
- AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the International World Wide Web Conference (WWW). 413–422. https://doi.org/10.1145/2488388.2488425
- Fast rule mining in ontological knowledge bases with AMIE+. Proceedings of the VLDB J. Endowment, (VLDB) (2015), 707–730. https://doi.org/10.1007/s00778-015-0394-1
- A Survey of Graph Neural Networks for Recommender Systems: Challenges, Methods, and Directions. ACM Trans. Recomm. Syst. 1, 1, Article 3 (mar 2023), 51Â pages. https://doi.org/10.1145/3568022
- Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. In Proceedings of the Thirtieth Conference on Artificial Intelligence (AAAI). 2608–2614. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11777
- I. J. Good. 1952. Rational Decisions. Journal of the Royal Statistical Society. Series B (Methodological) 14, 1 (1952), 107–114. http://www.jstor.org/stable/2984087
- Olaf Hartig. 2019. Foundations to Query Labeled Property Graphs using SPARQL. In Joint Proceedings of the 1st International Workshop On Semantics For Transport and the 1st International Workshop on Approaches for Making Data Interoperable co-located with 15th Semantics Conference (SEMANTiCS). http://ceur-ws.org/Vol-2447/paper3.pdf
- A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science. Proceedings of VLDB Endowment, (PVLDB) 12 (2021), 2075–2089. http://www.vldb.org/pvldb/vol13/p2075-christodoulakis.pdf
- A Scalable AutoML Approach Based on Graph Neural Networks. Proceedings of the VLDB Endowment 15, 11 (2022), 2428–2436. https://doi.org/10.14778/3551793.3551804
- TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678
- Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In ACM Knowledge Discovery and Data Mining (KDD). https://doi.org/10.1145/3292500.3330993
- TABBIE: Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 3446–3456. https://doi.org/10.18653/v1/2021.naacl-main.270
- Extracting ontological knowledge from Java source code using Hidden Markov Models. Open Computer Science (2019), 181–199. https://doi.org/10.1515/comp-2019-0013
- Billion-scale similarity search with GPUs. Proceedings of IEEE Transactions on Big Data 3 (2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572
- AutoLearn - Automated Feature Generation and Selection. In 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017, Vijay Raghavan, Srinivas Aluru, George Karypis, Lucio Miele, and Xindong Wu (Eds.). IEEE Computer Society, 217–226. https://doi.org/10.1109/ICDM.2017.31
- SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD Conference 2023. ACM. https://doi.org/10.1145/3588689
- RDF 1.1 Concepts and Abstract Syntax. World-Wide Web Consortium. https://www.w3.org/TR/rdf11-concepts/
- Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
- Federated Data Science to Break Down Silos [Vision]. SIGMOD Record 50, 4 (2021), 16–22. https://doi.org/10.1145/3516431.3516435
- The Odyssey Approach for Optimizing Federated SPARQL Queries. In Proceedings of the International Semantic Web Conference (ISWC), Vol. 10587. 471–489. https://doi.org/10.1007/978-3-319-68288-4_28
- Jonas Mueller and Alex Smola. 2019. Recognizing Variables from Their Data via Deep Embeddings of Distributions. In International Conference on Data Mining (ICDM). 1264–1269. https://doi.org/10.1109/ICDM.2019.00158
- Learning Feature Engineering for Classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Carles Sierra (Ed.). 2529–2535. https://doi.org/10.24963/ijcai.2017/352
- Table Union Search on Open Data. Proceedings of the VLDB Endowment, (PVLDB) (2018), 813–825. http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf
- DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. In SIGMOD. 2271–2280. https://doi.org/10.1145/3448016.3457330
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the Association for Computational Linguistics (ACL), Vol. 1. 1756–1765. https://doi.org/10.18653/v1/P17-1161
- Devin Petersohn. 2021. Scaling Data Science does not mean Scaling Machines. In Conference on Innovative Data Systems Research (CIDR). http://cidrdb.org/cidr2021/papers/cidr2021_abstract11.pdf
- Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. In 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). 729–735. https://doi.org/10.1109/ICSSIT48917.2020.9214160
- HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190–1201. https://doi.org/10.14778/3137628.3137631
- Horizon: Scalable Dependency-driven Data Cleaning. PVLDB 14, 11 (2021).
- PyCG: Practical Call Graph Generation in Python. International Conference on Software Engineering (ICSE) (2021), 1646–1657. https://doi.org/10.1109/ICSE43902.2021.00146
- FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. In Proceedings of The Semanic Web: Research and Applications (ESWC) (Lecture Notes in Computer Science). Springer, 481–486. https://doi.org/10.1007/978-3-642-21064-8_39
- Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web. 697–706. https://doi.org/10.1145/1242572.1242667
- OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
- Construction of Enterprise Knowledge Graphs (I). In Proceedings of Exploiting Linked Data and Knowledge Graphs in Large Organisations. 87–116. https://doi.org/10.1007/978-3-319-45654-6_4
- FLAML: A Fast and Lightweight AutoML Library. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 434–447. https://proceedings.mlsys.org/paper/2021/file/92cc227532d17e56e07902b254dfad10-Paper.pdf
- Selecting Top-k Data Science Models by Example Dataset. International Conference on Information and Knowledge Management (CIKM), 2686 – 2695. https://doi.org/10.1145/3583780.3615051
- OntoNotes Release 5.0 LDC2013T19. Web Download. https://doi.org/10.35111/xmhb-2b84
- Attention-based Learning for Missing Data Imputation in HoloClean. In Conference on Machine Learning and Systems. https://api.semanticscholar.org/CorpusID:211482719
- Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In SIGMOD. 1539–1554.
- Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. PVLDB 14, 11 (2021), 2563–2575.
- GraphSAINT: Graph Sampling Based Inductive Learning Method. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. https://openreview.net/forum?id=BJe8pkHFwS , GitHub Code: https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/mag/graph_saint.py.
- Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow. 13, 12 (jul 2020), 1835–1848. https://doi.org/10.14778/3407790.3407793
- Yi Zhang and Zachary Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In Proceedings of The International Conference on Management of Data, (SIGMOD). 1951–1966. https://doi.org/10.1145/3318464.3389726
- Semantic SPARQL Similarity Search Over RDF Knowledge Graphs. Proceedings of the VLDB Endowment, (PVLDB) (2016), 840–851. http://www.vldb.org/pvldb/vol9/p840-zheng.pdf
- JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the International Conference on Management of Data (SIGMOD). 847–864. https://doi.org/10.1145/3299869.3300065
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.