Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 149 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science (2303.02204v4)

Published 3 Mar 2023 in cs.LG

Abstract: In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. In this process, a myriad of artifacts (datasets, pipeline scripts, etc.) are created. However, there has been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists recover information and expertise from colleagues or learn via trial and error. Hence, this paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML. It shows that KGLiDS is significantly faster with a lower memory footprint than the state-of-the-art systems while achieving comparable or better accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. 2022. WALA Tool. Accessed: 2022-07-15. https://wala.github.io
  2. 2023. The GraphDB RDF Engine. Accessed: 2023-12-01. https://www.ontotext.com/products/graphdb/
  3. 2023. Kaggle Portal. Accessed: 2023-12-01. https://www.kaggle.com/
  4. 2024. Kaggle ML Survey 2022. Accessed: 2024-02-01. https://www.kaggle.com/kaggle-survey-2022
  5. A Toolkit for Generating Code Knowledge Graphs. Proceedings of Knowledge Capture Conference (K-CAP) (2021). https://doi.org/10.1145/3460210.3493578
  6. Lusail: A System for Querying Linked Data at Scale. Proceedings of the VLDB Endowment, (PVLDB) (2017), 485–498. http://www.vldb.org/pvldb/vol11/p485-abdelaziz.pdf
  7. Awny Alnusair and Tian Zhao. 2010. Component search and reuse: An ontology-based approach. In Proceedings of IEEE International Conference on Information Reuse & Integration, (IRI). 258–261. https://doi.org/10.1109/IRI.2010.5558931
  8. Mattia Atzeni and Maurizio Atzori. 2017. CodeOntology: RDF-ization of Source Code. In Proceedings of International Semantic Web Conference, (ISWC). 20–28. https://doi.org/10.1007/978-3-319-68204-4_2
  9. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.html
  10. Dataset Discovery in Data Lakes. In Proceedings of International Conference on Data Engineering (ICDE). 709–720. https://doi.org/10.1109/ICDE48307.2020.00067
  11. Reasoning on web data: Algorithms and performance. In Proceedings of the IEEE International Conference on Data Engineering (ICDE). 1541–1544. https://doi.org/10.1109/ICDE.2015.7113422
  12. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, 1335–1349. https://doi.org/10.1145/3318464.3389742
  13. Named graphs, provenance and trust. In Proceedings of the international conference on World Wide Web (WWW). 613–622. https://doi.org/10.1145/1060745.1060835
  14. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (nov 2020), 307–319. https://doi.org/10.14778/3430915.3430921
  15. Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  16. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 16, 7 (2023), 1726–1739. https://doi.org/10.14778/3587136.3587146
  17. Aurum: A Data Discovery System. In Proceedings of International Conference on Data Engineering (ICDE). 1001–1012. https://doi.org/10.1109/ICDE.2018.00094
  18. Efficient and Robust Automated Machine Learning. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS). 2962–2970. https://dl.acm.org/doi/10.5555/2969442.2969547
  19. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the International World Wide Web Conference (WWW). 413–422. https://doi.org/10.1145/2488388.2488425
  20. Fast rule mining in ontological knowledge bases with AMIE+. Proceedings of the VLDB J. Endowment, (VLDB) (2015), 707–730. https://doi.org/10.1007/s00778-015-0394-1
  21. A Survey of Graph Neural Networks for Recommender Systems: Challenges, Methods, and Directions. ACM Trans. Recomm. Syst. 1, 1, Article 3 (mar 2023), 51 pages. https://doi.org/10.1145/3568022
  22. Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. In Proceedings of the Thirtieth Conference on Artificial Intelligence (AAAI). 2608–2614. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11777
  23. I. J. Good. 1952. Rational Decisions. Journal of the Royal Statistical Society. Series B (Methodological) 14, 1 (1952), 107–114. http://www.jstor.org/stable/2984087
  24. Olaf Hartig. 2019. Foundations to Query Labeled Property Graphs using SPARQL. In Joint Proceedings of the 1st International Workshop On Semantics For Transport and the 1st International Workshop on Approaches for Making Data Interoperable co-located with 15th Semantics Conference (SEMANTiCS). http://ceur-ws.org/Vol-2447/paper3.pdf
  25. A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science. Proceedings of VLDB Endowment, (PVLDB) 12 (2021), 2075–2089. http://www.vldb.org/pvldb/vol13/p2075-christodoulakis.pdf
  26. A Scalable AutoML Approach Based on Graph Neural Networks. Proceedings of the VLDB Endowment 15, 11 (2022), 2428–2436. https://doi.org/10.14778/3551793.3551804
  27. TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678
  28. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In ACM Knowledge Discovery and Data Mining (KDD). https://doi.org/10.1145/3292500.3330993
  29. TABBIE: Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 3446–3456. https://doi.org/10.18653/v1/2021.naacl-main.270
  30. Extracting ontological knowledge from Java source code using Hidden Markov Models. Open Computer Science (2019), 181–199. https://doi.org/10.1515/comp-2019-0013
  31. Billion-scale similarity search with GPUs. Proceedings of IEEE Transactions on Big Data 3 (2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572
  32. AutoLearn - Automated Feature Generation and Selection. In 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017, Vijay Raghavan, Srinivas Aluru, George Karypis, Lucio Miele, and Xindong Wu (Eds.). IEEE Computer Society, 217–226. https://doi.org/10.1109/ICDM.2017.31
  33. SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD Conference 2023. ACM. https://doi.org/10.1145/3588689
  34. RDF 1.1 Concepts and Abstract Syntax. World-Wide Web Consortium. https://www.w3.org/TR/rdf11-concepts/
  35. Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
  36. Federated Data Science to Break Down Silos [Vision]. SIGMOD Record 50, 4 (2021), 16–22. https://doi.org/10.1145/3516431.3516435
  37. The Odyssey Approach for Optimizing Federated SPARQL Queries. In Proceedings of the International Semantic Web Conference (ISWC), Vol. 10587. 471–489. https://doi.org/10.1007/978-3-319-68288-4_28
  38. Jonas Mueller and Alex Smola. 2019. Recognizing Variables from Their Data via Deep Embeddings of Distributions. In International Conference on Data Mining (ICDM). 1264–1269. https://doi.org/10.1109/ICDM.2019.00158
  39. Learning Feature Engineering for Classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Carles Sierra (Ed.). 2529–2535. https://doi.org/10.24963/ijcai.2017/352
  40. Table Union Search on Open Data. Proceedings of the VLDB Endowment, (PVLDB) (2018), 813–825. http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf
  41. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. In SIGMOD. 2271–2280. https://doi.org/10.1145/3448016.3457330
  42. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. https://doi.org/10.3115/v1/D14-1162
  43. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the Association for Computational Linguistics (ACL), Vol. 1. 1756–1765. https://doi.org/10.18653/v1/P17-1161
  44. Devin Petersohn. 2021. Scaling Data Science does not mean Scaling Machines. In Conference on Innovative Data Systems Research (CIDR). http://cidrdb.org/cidr2021/papers/cidr2021_abstract11.pdf
  45. Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. In 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). 729–735. https://doi.org/10.1109/ICSSIT48917.2020.9214160
  46. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190–1201. https://doi.org/10.14778/3137628.3137631
  47. Horizon: Scalable Dependency-driven Data Cleaning. PVLDB 14, 11 (2021).
  48. PyCG: Practical Call Graph Generation in Python. International Conference on Software Engineering (ICSE) (2021), 1646–1657. https://doi.org/10.1109/ICSE43902.2021.00146
  49. FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. In Proceedings of The Semanic Web: Research and Applications (ESWC) (Lecture Notes in Computer Science). Springer, 481–486. https://doi.org/10.1007/978-3-642-21064-8_39
  50. Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web. 697–706. https://doi.org/10.1145/1242572.1242667
  51. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
  52. Construction of Enterprise Knowledge Graphs (I). In Proceedings of Exploiting Linked Data and Knowledge Graphs in Large Organisations. 87–116. https://doi.org/10.1007/978-3-319-45654-6_4
  53. FLAML: A Fast and Lightweight AutoML Library. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 434–447. https://proceedings.mlsys.org/paper/2021/file/92cc227532d17e56e07902b254dfad10-Paper.pdf
  54. Selecting Top-k Data Science Models by Example Dataset. International Conference on Information and Knowledge Management (CIKM), 2686 – 2695. https://doi.org/10.1145/3583780.3615051
  55. OntoNotes Release 5.0 LDC2013T19. Web Download. https://doi.org/10.35111/xmhb-2b84
  56. Attention-based Learning for Missing Data Imputation in HoloClean. In Conference on Machine Learning and Systems. https://api.semanticscholar.org/CorpusID:211482719
  57. Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In SIGMOD. 1539–1554.
  58. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. PVLDB 14, 11 (2021), 2563–2575.
  59. GraphSAINT: Graph Sampling Based Inductive Learning Method. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. https://openreview.net/forum?id=BJe8pkHFwS , GitHub Code: https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/mag/graph_saint.py.
  60. Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow. 13, 12 (jul 2020), 1835–1848. https://doi.org/10.14778/3407790.3407793
  61. Yi Zhang and Zachary Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In Proceedings of The International Conference on Management of Data, (SIGMOD). 1951–1966. https://doi.org/10.1145/3318464.3389726
  62. Semantic SPARQL Similarity Search Over RDF Knowledge Graphs. Proceedings of the VLDB Endowment, (PVLDB) (2016), 840–851. http://www.vldb.org/pvldb/vol9/p840-zheng.pdf
  63. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the International Conference on Management of Data (SIGMOD). 847–864. https://doi.org/10.1145/3299869.3300065
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: