Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines (2303.03132v2)

Published 6 Mar 2023 in cs.DB and cs.LG

Abstract: The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally cheap method to select candidate record pairs, and a matcher that afterwards identifies matching pairs from this set using more expensive methods. This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space, and nearest neighbour search for candidate set building. We benchmark SC-Block against eight state-of-the-art blocking methods. In order to relate the training time of SC-Block to the reduction of the overall runtime of the entity resolution pipeline, we combine SC-Block with four matching methods into complete pipelines. For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher. The results show that SC-Block is able to create smaller candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster compared to pipelines with other blockers, without sacrificing F1 score. Blockers are often evaluated using relatively small datasets which might lead to runtime effects resulting from a large vocabulary size being overlooked. In order to measure runtimes in a more challenging setting, we introduce a new benchmark dataset that requires large numbers of product offers to be blocked. On this large-scale benchmark dataset, pipelines utilizing SC-Block and the best-performing matcher execute 8 times faster than pipelines utilizing another blocker with the same matcher reducing the runtime from 2.5 hours to 18 minutes, clearly compensating for the 5 minutes required for training SC-Block.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Akiko Aizawa and Keizo Oyama. 2005. A Fast Linkage Detection Scheme for Multi-Source Information Integration. In International Workshop on Challenges in Web Information Retrieval and Integration. 30–39.
  2. Adaptive Blocking: Learning to Scale Up Record Linkage. In ICDM 2006. 87–96.
  3. Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures - a step forward in data integration. In EDBT 2020.
  4. Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning. In ICML 2022. 3090–3122.
  5. A Simple Framework for Contrastive Learning of Visual Representations. In ICML 2020. 1597–1607.
  6. Peter Christen. 2012a. Data matching : concepts and techniques for record linkage, entity resolution, and duplicate detection.
  7. Peter Christen. 2012b. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering 24, 9 (2012), 1537–1555.
  8. An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6 (2021), 1–42.
  9. Debiased Contrastive Learning. In NeurIPS 2020, Vol. 33. 8765–8775.
  10. William W. Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD 2002. 475–480.
  11. Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11, 11 (2018), 1454–1467.
  12. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems 65 (2017), 137–157.
  13. Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183–1210.
  14. Generalized supervised meta-blocking. In VLDB 2022, Vol. 15. 1902–1910.
  15. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP 2021. 6894–6910.
  16. DeepBlock: A Novel Blocking Approach for Entity Resolution using Deep Learning. In ICWR 2019. 41–44.
  17. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7, 3 (2021), 535–547.
  18. Supervised Contrastive Learning. In NeurIPS 2020, Vol. 33. 18661–18673.
  19. Magellan: toward building entity matching management systems. In VLDB 2016, Vol. 9. 1197–1208.
  20. Deep Entity Matching with Pre-Trained Language Models. VLDB 2020 14, 1 (2020), 50–60.
  21. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (2019).
  22. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD 2018. 19–34.
  23. John Bosco Mugeni and Toshiyuki Amagasa. 2022. A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings. SIGAPP Appl. Comput. Rev. 22, 4 (2022), 37–46.
  24. Three-dimensional Entity Resolution with JedAI. Information Systems 93 (2020), 101565.
  25. Supervised meta-blocking. VLDB 2014 7, 14 (2014), 1929–1940.
  26. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2021), 1–42.
  27. Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching. In VLDB 2023, Vol. 16. 1507–1519.
  28. Ralph Peeters and Christian Bizer. 2021. Dual-objective fine-tuning of BERT for entity matching. In VLDB 2021, Vol. 14 10. 1913–1921.
  29. Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning for Product Matching. In WWW 2022. 248–251.
  30. WDC Products: A Multi-Dimensional Entity Matching Benchmark.
  31. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP 2019. 3982–3992.
  32. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. INR 3, 4 (2009), 333–389.
  33. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
  34. Deep learning for blocking in entity matching: a design space exploration. In VLDB 2021. 2459–2472.
  35. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.
  36. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In ICML 2021. 12310–12320.
  37. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In WSDM 2020 (WSDM ’20). 744–752.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Alexander Brinkmann (5 papers)
  2. Roee Shraga (20 papers)
  3. Christian Bizer (15 papers)
Citations (3)