AdaTyper: Adaptive Semantic Column Type Detection (2311.13806v1)
Abstract: Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.
- DBpedia: A nucleus for a web of open data. ISWC (2007), 722–735.
- Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics. 18–26.
- Ten years of webtables. Proceedings of the VLDB Endowment 11, 12 (2018), 2140–2149.
- Universal Sentence Encoder. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Nov. 2018), 169–174. https://doi.org/10.18653/v1/D18-2029
- Learning semantic annotations for tabular data. arXiv preprint arXiv:1906.00781 (2019).
- Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1247–1261.
- Observatory: Characterizing Embeddings of Relational Tables. arXiv preprint arXiv:2310.07736 (2023).
- TURL: Table Understanding through Representation Learning. 14, 3 (Nov. 2020), 307–319.
- Reducing network agnostophobia. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9175–9186.
- Exploiting structure within data for accurate labeling using conditional random fields. In (ICAI).
- Google. 2019. Google Data Studio. https://datastudio.google.com
- Google. 2021. Google — Dverview Studio. https://developers.google.com/datastudio/connector/reference#semantictype
- Schema. org: evolution of structured data on the web. Commun. ACM 59, 2 (2016), 44–51.
- Auto-Tag: Tagging-Data-By-Example in Data Lakes. arXiv preprint arXiv:2112.06049 (2021).
- Open Domain Question Answering over Tables via Dense Retrieval. arXiv preprint arXiv:2103.12011 (2021).
- VizML: A machine learning approach to visualization recommendation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
- GitTables: A large-scale corpus of relational tables. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–17.
- Making Table Understanding Work in Practice. 12th Annual Conference on Innovative Data Systems Research (CIDR’22) (2022).
- Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1500–1508.
- TABBIE: Pretrained Representations of Tabular Data. arXiv preprint arXiv:2105.02584 (2021).
- Thorsten Joachims and Filip Radlinski. 2007. Search engines that learn from implicit feedback. Computer 40, 8 (2007), 34–40.
- CHORUS: Foundation Models for Unified Data Discovery and Exploration. arXiv preprint arXiv:2306.09610 (2023).
- Wouter M Kouw and Marco Loog. 2018. An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806 (2018).
- Towards Learned Metadata Extraction for Data Lakes. BTW 2021 (2021).
- Wen-Syan Li and Chris Clifton. 1994. Semantic integration in heterogeneous databases using neural networks. In PVLDB.
- Annotating and searching web tables using entities, types and relationships. PVLDB 3, 1-2 (2010), 1338–1347.
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
- Microsoft. 2021. Microsoft — Power BI. https://powerbi.microsoft.com
- Jan Motl and Oliver Schulte. 2015. The CTU prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015).
- From Tables to Knowledge: Recent Advances in Table Understanding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, 4060–4061.
- Nikhil Puranik. 2012. A specialist approach for the classification of column data. University of Maryland, Baltimore County.
- Failing loudly: An empirical study of methods for detecting dataset shift. arXiv preprint arXiv:1810.11953 (2018).
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381–390.
- Data Curation at Scale: The Data Tamer System.. In CIDR, Vol. 2013. Citeseer.
- Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 41, 2 (2018), 3–9.
- Annotating Columns with Pre-trained Language Models. arXiv:2104.01785 [cs.DB]
- Meimei: An efficient probabilistic approach for semantically annotating tables. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 281–288.
- Talend. 2021a. Talend. https://help.talend.com/r/en-US/7.2/studio-user-guide-real-time-big-data-platform/list-of-indexes-and-regex-categories-used-in-semantic-aware-analysis.
- Talend. 2021b. Talend. https://www.talend.com.
- Evaluation of contextual information retrieval effectiveness: overview of issues and research. Knowledge and Information Systems 24, 1 (2010), 1–34.
- Trifacta. 2012–2021a. Trifacta. https://www.trifacta.com.
- Trifacta. 2021b. Trifacta — Custom Type Dialog. https://docs.trifacta.com/display/r076/Custom+Type+Dialog
- Get real: How benchmarks fail to represent the real world. In Proceedings of the Workshop on Testing Database Systems. 1–6.
- TCN: Table Convolutional Network for Web Table Interpretation. arXiv preprint arXiv:2102.09460 (2021).
- WebDataCommons. 2021. WDC Web Table Corpus 2012. http://webdatacommons.org/webtables/2012/relationalStatistics.html
- Cong Yan and Yeye He. 2018. Synthesizing type-detection logic for rich semantic data types using open-source code. In SIGMOD. ACM, 35–50.
- TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL.
- Sato: Contextual Semantic Type Detection in Tables. Proceedings of the VLDB Endowment 13, 11 (2020), 1835–1848.
- Madelon Hulsebos (13 papers)
- Paul Groth (51 papers)
- Çağatay Demiralp (38 papers)