Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WDC Products: A Multi-Dimensional Entity Matching Benchmark (2301.09521v2)

Published 23 Jan 2023 in cs.LG

Abstract: The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-world data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size (training set plus validation set). Generalization to unseen entities is a dimension not covered by any of the existing English-language benchmarks yet but is crucial for evaluating the robustness of entity matching systems. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that for entity matching contrastive learning is more training data efficient compared to cross-encoders.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Probing the Robustness of Pre-trained Language Models for Entity Matching. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3786–3790.
  2. Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Transactions on Knowledge Discovery from Data 15, 3 (2021), 52:1–52:37.
  3. SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines. arXiv:cs/2303.03132
  4. Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - a Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology. 463–473.
  5. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning. 1597–1607.
  6. Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg.
  7. Peter Christen and Dinusha Vatsalan. 2013. Flexible and Extensible Generation and Corruption of Personal Data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 1165–1168.
  8. An Overview of End-to-End Entity Resolution for Big Data. Comput. Surveys 53, 6 (2020), 127:1–127:42.
  9. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv:2101.11259 [cs] (Feb. 2021).
  10. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 4171–4186.
  11. Uwe Draisbach and Felix Naumann. 2017. DuDe: The Duplicate Detection Toolkit. Proceedings of the International Workshop on Quality in Databases (2017), 7.
  12. In Search of Robust Measures of Generalization. In Advances in Neural Information Processing Systems, Vol. 33. 11723–11733.
  13. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1–16.
  14. Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183–1210.
  15. Benchmarking Matching Applications on the Semantic Web. In The Semantic Web: Research and Applications. Springer, Berlin, Heidelberg, 108–122.
  16. Dehong Gao. 2020. Deep Hierarchical Classification for Category Prediction in E-commerce System. In Proceedings of the 3rd Workshop on E-Commerce and NLP. 64–68.
  17. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6894–6910.
  18. CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration. arXiv:2108.08090 [cs] (Sept. 2021).
  19. FlexER: Flexible Entity Resolution for Multiple Intents. Proceedings of the ACM on Management of Data 1, 1 (May 2023), 42:1–42:27.
  20. Large-Scale Data Pollution with Apache Spark. IEEE Transactions on Big Data 6, 2 (June 2020), 396–411.
  21. On Generating Benchmark Data for Entity Matching. Journal on Data Semantics 2, 1 (March 2013), 37–56.
  22. Supervised Contrastive Learning. In Advances in Neural Information Processing Systems, Vol. 33. 18661–18673.
  23. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197–1208.
  24. Evaluation of Entity Resolution Approaches on Real-World Match Problems. Proceedings of the VLDB Endowment 3, 1-2 (Sept. 2010), 484–493.
  25. Deep Entity Matching with Pre-Trained Language Models. Proceedings of the VLDB Endowment 14, 1 (2020), 50–60.
  26. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (2019).
  27. Knowledge Transfer for Entity Resolution with Siamese Neural Networks. Journal of Data and Information Quality 13, 1 (Jan. 2021), 2:1–2:25.
  28. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data. 19–34.
  29. On the Generalization Ability of Neural Network Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 6 (June 1994), 659–663.
  30. How to Reduce the Search Space of Entity Resolution: With Blocking or Nearest Neighbor Search? arXiv:cs/2202.12521
  31. Blocking and Filtering Techniques for Entity Resolution: A Survey. Comput. Surveys 53, 2 (March 2020), 31:1–31:42.
  32. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. Proceedings of the VLDB Endowment 9, 9 (May 2016), 684–695.
  33. Scikit-Learn: Machine Learning in Python. Journal of machine learning research 12 (2011), 2825–2830.
  34. Ralph Peeters and Christian Bizer. 2021. Dual-Objective Fine-Tuning of BERT for Entity Matching. Proceedings of the VLDB Endowment 14, 10 (2021), 1913–1921.
  35. Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning for Product Matching. In Companion Proceedings of the Web Conference 2022. 248–251.
  36. Anna Primpeli and Christian Bizer. 2020. Profiling Entity Matching Benchmark Tasks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3101–3108.
  37. Anna Primpeli and Christian Bizer. 2022. Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods. In The Semantic Web. 113–129.
  38. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings of The 2019 World Wide Web Conference. 381–386.
  39. LANCE: Piercing to the Heart of Instance Matching Tools. In The Semantic Web - ISWC 2015. 375–391.
  40. FaceNet: A Unified Embedding for Face Recognition and Clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
  41. Neural Network Based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the Association for Computational Linguistics, Volume 3. 8–15.
  42. Carlos N. Silla and Alex A. Freitas. 2011. A Survey of Hierarchical Classification across Different Application Domains. Data Mining and Knowledge Discovery 22, 1 (Jan. 2011), 31–72.
  43. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In 2015 IEEE International Conference on Computer Vision. 118–126.
  44. DAME: Domain Adaptation for Matching Entities. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1016–1024.
  45. Domain Adaptation for Deep Entity Resolution. In Proceedings of the 2022 International Conference on Management of Data. 443–457.
  46. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.
  47. Entity Matching: How Similar Is Similar. Proceedings of the VLDB Endowment 4, 10 (July 2011), 622–633.
  48. Machamp: A Generalized Entity Matching Benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4633–4642.
  49. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. arXiv:cs/2207.04122
  50. Bridging the Gap between Reality and Ideality of Entity Matching: A Revisting and Benchmark Re-Constrcution. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. 3978–3984.
  51. Generalized Out-of-Distribution Detection: A Survey. arXiv:cs/2110.11334
  52. Entity Resolution with Hierarchical Graph Attention Networks. In Proceedings of the 2022 International Conference on Management of Data. 429–442.
  53. Optimizing Dense Retrieval Model Training with Hard Negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ralph Peeters (6 papers)
  2. Reng Chiz Der (2 papers)
  3. Christian Bizer (15 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com