Interactive Ontology Matching with Cost-Efficient Learning (2404.07663v1)
Abstract: The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion 76 (2021), 243–297.
- AML 2022. Reported results for interactive ontology matchers. http://oaei.ontologymatching.org/2021/results/interactive/.
- atmonto2airm 2019. A reference alignment between the NASA ATM Ontolgy and the ATM Information Reference Model Ontology. https://airm-o.github.io/atmonto2airm/.
- Brick: Towards a unified metadata schema for buildings. In Proceedings of the 3rd ACM International Conference on Systems for Energy-Efficient Built Environments. 41–50.
- Ontology matching using convolutional neural networks. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 5648–5653.
- Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling. In International Conference on Learning Representations.
- Interactive programmatic labeling for weak supervision. In Proc. KDD DCCL Workshop.
- ConfidenceInterval 2022. 95 confidence interval. https://en.wikipedia.org.
- Alin: improving interactive ontology matching by interactively revising mapping suggestions. The Knowledge Engineering Review 35 (2020).
- Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data. 1431–1446.
- Learning to match ontologies on the semantic web. The VLDB journal 12 (2003), 303–319.
- Xin Luna Dong. 2023. Generations of knowledge graphs: The crazy ideas and the business impact. arXiv preprint arXiv:2308.14217 (2023).
- The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
- Orri Erling. 2012. Virtuoso, a Hybrid RDBMS/Graph Column Store. IEEE Data Eng. Bull. 35, 1 (2012), 3–8.
- The agreementmakerlight ontology matching system. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. Springer, 527–541.
- Online entity resolution using an oracle. Proceedings of the VLDB Endowment 9, 5 (2016), 384–395.
- Versamatch: ontology matching with weak supervision. In 49th Conference on Very Large Data Bases (VLDB), Vancouver, Canada, 28 August-1 September 2023, Vol. 16. Association for Computing Machinery, 1305–1318.
- A data-driven meta-data inference framework for building automation systems. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments. 23–32.
- Interlinking the Brick Schema with Building Domain Ontologies. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems. 1026–1030.
- Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 601–612.
- BERTMap: A BERT-based ontology alignment system. In Proceedings of the AAAI Conference on Artificial Intelligence. 5684–5691.
- Patrick Hohenecker and Thomas Lukasiewicz. 2020. Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research 68 (2020), 503–540.
- Huggingface 2022. Sentence transformers published by Huggingface. https://huggingface.co/sentence-transformers.
- Answer validation for generic crowdsourcing tasks with minimal efforts. The VLDB Journal 26 (2017), 855–880.
- Ernesto Jiménez-Ruiz and Bernardo Cuenca Grau. 2011. Logmap: Logic-based and scalable ontology matching. In International Semantic Web Conference. Springer, 273–288.
- ReGAL: Rule-Generative Active Learning for Model-in-the-Loop Weak Supervision. In NeurIPS 2020 HAMLETS workshop on Human and Model in the Loop Evaluation and Training Strategies.
- KE4WoT 2022. KE4WoT Open Challenge dataset. https://lov4iot.appspot.com.
- Spider4SPARQL: A Complex Benchmark for Evaluating Knowledge Graph Question Answering Systems. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 5272–5281.
- Evaluation of the opportunities and limitations of using IFC models as source of building metadata. In Proceedings of the 5th Conference on Systems for Built Environments. 21–24.
- Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002).
- Extending the Brick schema to represent metadata of occupants. Automation in Construction 139 (2022), 104307.
- OntoAugment: Ontology Matching through Weakly-Supervised Label Augmentation. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 420–425.
- A comprehensive benchmark framework for active learning methods in entity matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1133–1147.
- Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577.
- On the importance of adaptive data collection for extremely imbalanced pairwise tasks. arXiv preprint arXiv:2010.05103 (2020).
- Asterisk: Generating Large Training Datasets with Automatic Active Supervision. ACM/IMS Transactions on Data Science 1, 2, Article 13 (May 2020), 25 pages.
- WeSAL: Applying active supervision to find high-quality labels at industrial scale. In Proceedings of the 53rd Hawaii International Conference on System Sciences.
- DuyHoa Ngo and Zohra Bellahsene. 2012. YAM++: a multi-strategy based approach for ontology matching task. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 421–425.
- OAEI 2021. OAEI Conference track. http://oaei.ontologymatching.org/2021/conference/index.html.
- OAEI 2022. OAEI Interactive track. https://oaei.ontologymatching.org/2022/results/interactive/interactive.htm.
- Snorkel: Rapid training data creation with weak supervision. The VLDB Journal 29, 2 (2020), 709–730.
- Data programming: Creating large training sets, quickly. Advances in neural information processing systems 29 (2016).
- Nils Reimers and Iryna Gurevych. 2019a. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
- Nils Reimers and Iryna Gurevych. 2019b. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
- Andrew I Schein and Lyle H Ungar. 2007. Active learning for logistic regression: an evaluation. Machine Learning 68, 3 (2007), 235–265.
- Actively learning ontology matching via user interaction. In International Semantic Web Conference. Springer, 585–600.
- Pavel Shvaiko and Jérôme Euzenat. 2011. Ontology matching: state of the art and future challenges. IEEE Transactions on knowledge and data engineering 25, 1 (2011), 158–176.
- Enabling data spaces: Existing developments and challenges. In Proceedings of the 1st International Workshop on Data Economy. 42–48.
- Active learning of GAV schema mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 355–368.
- Deep learning for blocking in entity matching: a design space exploration. Proceedings of the VLDB Endowment 14, 11 (2021), 2459–2472.
- Matching Ontologies for Air Traffic Management: a Comparison and Reference Alignment of the AIRM and NASA ATM Ontologies.. In OM@ ISWC. 1–12.
- Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment 7, 12 (2014), 1071–1082.
- Graph learning: A survey. IEEE Transactions on Artificial Intelligence 2, 2 (2021), 109–127.
- Optimizing Ontology Alignment Through an Interactive Compact Genetic Algorithm. ACM Trans. Manage. Inf. Syst. 12, 2, Article 14 (May 2021), 17 pages.
- Bin Cheng (74 papers)
- Jonathan Fürst (7 papers)
- Tobias Jacobs (11 papers)
- Celia Garrido-Hidalgo (1 paper)