SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines (2303.03132v2)
Abstract: The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally cheap method to select candidate record pairs, and a matcher that afterwards identifies matching pairs from this set using more expensive methods. This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space, and nearest neighbour search for candidate set building. We benchmark SC-Block against eight state-of-the-art blocking methods. In order to relate the training time of SC-Block to the reduction of the overall runtime of the entity resolution pipeline, we combine SC-Block with four matching methods into complete pipelines. For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher. The results show that SC-Block is able to create smaller candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster compared to pipelines with other blockers, without sacrificing F1 score. Blockers are often evaluated using relatively small datasets which might lead to runtime effects resulting from a large vocabulary size being overlooked. In order to measure runtimes in a more challenging setting, we introduce a new benchmark dataset that requires large numbers of product offers to be blocked. On this large-scale benchmark dataset, pipelines utilizing SC-Block and the best-performing matcher execute 8 times faster than pipelines utilizing another blocker with the same matcher reducing the runtime from 2.5 hours to 18 minutes, clearly compensating for the 5 minutes required for training SC-Block.
- Akiko Aizawa and Keizo Oyama. 2005. A Fast Linkage Detection Scheme for Multi-Source Information Integration. In International Workshop on Challenges in Web Information Retrieval and Integration. 30–39.
- Adaptive Blocking: Learning to Scale Up Record Linkage. In ICDM 2006. 87–96.
- Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures - a step forward in data integration. In EDBT 2020.
- Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning. In ICML 2022. 3090–3122.
- A Simple Framework for Contrastive Learning of Visual Representations. In ICML 2020. 1597–1607.
- Peter Christen. 2012a. Data matching : concepts and techniques for record linkage, entity resolution, and duplicate detection.
- Peter Christen. 2012b. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering 24, 9 (2012), 1537–1555.
- An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6 (2021), 1–42.
- Debiased Contrastive Learning. In NeurIPS 2020, Vol. 33. 8765–8775.
- William W. Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD 2002. 475–480.
- Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11, 11 (2018), 1454–1467.
- Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems 65 (2017), 137–157.
- Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183–1210.
- Generalized supervised meta-blocking. In VLDB 2022, Vol. 15. 1902–1910.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP 2021. 6894–6910.
- DeepBlock: A Novel Blocking Approach for Entity Resolution using Deep Learning. In ICWR 2019. 41–44.
- Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7, 3 (2021), 535–547.
- Supervised Contrastive Learning. In NeurIPS 2020, Vol. 33. 18661–18673.
- Magellan: toward building entity matching management systems. In VLDB 2016, Vol. 9. 1197–1208.
- Deep Entity Matching with Pre-Trained Language Models. VLDB 2020 14, 1 (2020), 50–60.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (2019).
- Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD 2018. 19–34.
- John Bosco Mugeni and Toshiyuki Amagasa. 2022. A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings. SIGAPP Appl. Comput. Rev. 22, 4 (2022), 37–46.
- Three-dimensional Entity Resolution with JedAI. Information Systems 93 (2020), 101565.
- Supervised meta-blocking. VLDB 2014 7, 14 (2014), 1929–1940.
- Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2021), 1–42.
- Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching. In VLDB 2023, Vol. 16. 1507–1519.
- Ralph Peeters and Christian Bizer. 2021. Dual-objective fine-tuning of BERT for entity matching. In VLDB 2021, Vol. 14 10. 1913–1921.
- Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning for Product Matching. In WWW 2022. 248–251.
- WDC Products: A Multi-Dimensional Entity Matching Benchmark.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP 2019. 3982–3992.
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. INR 3, 4 (2009), 333–389.
- A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
- Deep learning for blocking in entity matching: a design space exploration. In VLDB 2021. 2459–2472.
- Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.
- Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In ICML 2021. 12310–12320.
- AutoBlock: A Hands-off Blocking Framework for Entity Matching. In WSDM 2020 (WSDM ’20). 744–752.
- Alexander Brinkmann (5 papers)
- Roee Shraga (20 papers)
- Christian Bizer (15 papers)