Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching (2307.02726v1)

Published 6 Jul 2023 in cs.DB, cs.CY, and cs.LG

Abstract: Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. [n.d.]. u.s. census bureau quickfacts: united states. https://www.census.gov/quickfacts/fact/table/US/PST045221
  2. 2015. COMPAS Recidivism Risk Score Data and Analysis. www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis.
  3. [visited: 2023]. CSRankings GitHub Repository. https://github.com/emeryberger/CSRankings.
  4. Ernest Donald Acheson et al. 1967. Medical record linkage. Medical record linkage. (1967).
  5. IBM Watson Advertising. 2022. Bias in Advertising: Confronting & Addressing the Challenge. https://www.ibm.com/watson-advertising/thought-leadership/bias-in-advertising.
  6. Designing fair ranking schemes. In Proceedings of the 2019 international conference on management of data. 1259–1276.
  7. Abolfazl Asudeh and H. V. Jagadish. 2020. Fairly evaluating and scoring items in a data set. PVLDB 13, 12 (2020), 3445–3448.
  8. Assessing and remedying coverage for a given dataset. In ICDE. IEEE, 554–565.
  9. Identifying insufficient data coverage for ordinal continuous-valued attributes. In Proceedings of the 2021 international conference on management of data. 129–141.
  10. Tho Bach and Kenny Bernat. 2022. The Business Impact of Biased Advertising (and How to Fix It). https://www.wpromote.com/blog/digital-marketing/biased-advertising.
  11. Nils Barlaug. 2022. LEMON: explainable entity matching. IEEE Transactions on Knowledge and Data Engineering (2022).
  12. Nils Barlaug and Jon Atle Gulla. 2021. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 3 (2021), 1–37.
  13. Fairness and machine learning: Limitations and opportunities. fairmlbook.org.
  14. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943 (2018).
  15. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).
  16. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems 30 (2017).
  17. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the conference on fairness, accountability, and transparency. 319–328.
  18. GNEM: a generic one-to-set neural entity matching framework. In Proceedings of the Web Conference 2021. 1686–1694.
  19. Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163.
  20. An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR) 53, 6 (2020), 1–42.
  21. Equal Employment Opportunity Commission. 1979. The U.S. Uniform guidelines on employee selection procedures.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  23. Interpreting deep learning models for entity resolution: an experience report using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–4.
  24. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214–226.
  25. FairER: Entity Resolution With Fairness Constraints. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3004–3008.
  26. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259–268.
  27. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3665–3671.
  28. Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe.
  29. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016).
  30. InDeXLab. 2023. Fair Entity Matching. github.com/UIC-InDeXLab/fair_entity_matching/tree/main/synthetic%20dataset%20generator.
  31. A novel ensemble learning approach to unsupervised record linkage. Information Systems 71 (2017), 40–54.
  32. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
  33. Pradap Venkatramanan Konda. 2018. Magellan: Toward building entity matching management systems. The University of Wisconsin-Madison.
  34. Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197–210.
  35. William J Krouse and Bart Elias. 2009. Terrorist watchlist checks and air passenger prescreening. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE.
  36. Counterfactual fairness. Advances in neural information processing systems 30 (2017).
  37. A survey on blocking technology of entity resolution. Journal of Computer Science and Technology 35 (2020), 769–793.
  38. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).
  39. Ling Liu. 2022. Ensemble Learning Methods for Dirty Data. In CIKM, Keynote.
  40. Towards a more Accurate and Fair SVM-based Record Linkage. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 4691–4699.
  41. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  42. Alex P Miller and Kartik Hosanagar. 2019. How targeted ads and dynamic pricing can perpetuate bias. Harvard Business Review (2019).
  43. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19–34.
  44. Tailoring data source distributions for fairness-aware data integration. PVLDB 14, 11 (2021), 2519–2532.
  45. Responsible Data Integration: Next-generation Challenges. In Proceedings of the 2022 International Conference on Management of Data. 2458–2464.
  46. Entity Matching with AUC-Based Fairness. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 5068–5075.
  47. Tuner: Fine tuning of rule-based entity matchers. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2945–2948.
  48. Towards Interactive Debugging of Rule-based Entity Matching.. In EDBT. 354–365.
  49. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR) 53, 2 (2020), 1–42.
  50. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment 9, 9 (2016), 684–695.
  51. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  52. The WDC training dataset and gold standard for large-scale product matching. In Companion Proceedings of The 2019 World Wide Web Conference. 381–386.
  53. Mark Scanlon. 2016. Battling the digital forensic backlog through data deduplication. In 2016 sixth international conference on innovative computing technology (INTECH). IEEE, 10–14.
  54. Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching. https://github.com/UIC-InDeXLab/fair_entity_matching/blob/main/techrep.pdf.
  55. Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Computing Surveys (2023).
  56. Fairness-Aware Range Queries for Selecting Unbiased Data. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE.
  57. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189–202.
  58. Maximizing Fair Content Spread via Edge Suggestion in Social Networks. Proceedings of the VLDB Endowment 15, 11 (2022).
  59. Attention is all you need. Advances in neural information processing systems 30 (2017).
  60. Entity matching: How similar is similar. Proceedings of the VLDB Endowment 4, 10 (2011), 622–633.
  61. Jin Wang and Yuliang Li. 2022. Minun: evaluating counterfactual explanations for entity matching. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning. 1–11.
  62. Machamp: A Generalized Entity Matching Benchmark. In CIKM. ACM, 4633–4642.
  63. A method for entity resolution in high dimensional data using ensemble classifiers. Mathematical Problems in Engineering 2017 (2017).
  64. String similarity search and join: a survey. Frontiers of Computer Science 10, 3 (2016), 399–417.
  65. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics. PMLR, 962–970.
  66. Learning fair representations. In International conference on machine learning. PMLR, 325–333.
  67. Multi-context attention for entity matching. In Proceedings of The Web Conference 2020. 2634–2640.
  68. FairRover: explorative model building for fair and responsible machine learning. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–10.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nima Shahbazi (9 papers)
  2. Nikola Danevski (1 paper)
  3. Fatemeh Nargesian (12 papers)
  4. Abolfazl Asudeh (46 papers)
  5. Divesh Srivastava (37 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.