Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach (2401.03426v2)

Published 7 Jan 2024 in cs.CL and cs.AI

Abstract: Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. LLMs introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. L. Getoor and A. Machanavajjhala, “Entity resolution for big data,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 1527–1527.
  2. V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, “End-to-end entity resolution for big data: A survey,” arXiv preprint arXiv:1905.06397, 2019.
  3. W. E. Winkler, “Matching and record linkage,” Wiley interdisciplinary reviews: Computational statistics, vol. 6, no. 5, pp. 313–325, 2014.
  4. I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, no. 328, pp. 1183–1210, 1969.
  5. T. Blakely and C. Salmond, “Probabilistic record linkage and a method to calculate the positive predictive value,” International journal of epidemiology, vol. 31, no. 6, pp. 1246–1252, 2002.
  6. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Transactions on knowledge and data engineering, vol. 19, no. 1, pp. 1–16, 2006.
  7. D. G. Brizan and A. U. Tansel, “A. survey of entity resolution and record linkage methodologies,” Communications of the IIMA, vol. 6, no. 3, p. 5, 2006.
  8. P. Christen, “Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. 2012.”
  9. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  10. B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
  11. L. Getoor and A. Machanavajjhala, “Entity resolution: theory, practice & open challenges,” Proceedings of the VLDB Endowment, vol. 5, no. 12, pp. 2018–2019, 2012.
  12. A. Zeakis, G. Papadakis, D. Skoutas, and M. Koubarakis, “Pre-trained embeddings for entity resolution: An experimental analysis,” Proceedings of the VLDB Endowment, vol. 16, no. 9, pp. 2225–2238, 2023.
  13. M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang, “Deeper–deep entity resolution,” arXiv preprint arXiv:1710.00597, 2017.
  14. Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching with pre-trained language models,” arXiv preprint arXiv:2004.00584, 2020.
  15. M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’03.   New York, NY, USA: Association for Computing Machinery, 2003, p. 39–48. [Online]. Available: https://doi.org/10.1145/956750.956759
  16. P. Christen, “Febrl: A freely available record linkage system with a graphical user interface,” in Proceedings of the Second Australasian Workshop on Health Data and Knowledge Management - Volume 80, ser. HDKM ’08.   AUS: Australian Computer Society, Inc., 2008, p. 17–25.
  17. A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller, “Human-powered sorts and joins,” arXiv preprint arXiv:1109.6881, 2011.
  18. J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: Crowdsourcing entity resolution,” arXiv preprint arXiv:1208.1927, 2012.
  19. V. Verroios and H. Garcia-Molina, “Entity resolution with crowd errors,” in 2015 IEEE 31st International Conference on Data Engineering.   IEEE, 2015, pp. 219–230.
  20. M. Sviridenko, “A note on maximizing a submodular set function subject to a knapsack constraint,” Operations Research Letters, vol. 32, no. 1, pp. 41–43, 2004.
  21. Y. Faenza, “Submodular functions.”
  22. J. De Bruin, “Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python,” Dec. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3559043
  23. M. A. Jaro, “Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida,” Journal of the American Statistical Association, vol. 84, no. 406, pp. 414–420, 1989.
  24. ——, “Probabilistic linkage of large public health data files,” Statistics in medicine, vol. 14, no. 5-7, pp. 491–498, 1995.
  25. W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage.” 1990.
  26. G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001.
  27. C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao, “Reducing uncertainty of schema matching via crowdsourcing,” Proceedings of the VLDB Endowment, vol. 6, no. 9, pp. 757–768, 2013.
  28. P. Christen and K. Goiser, “Quality and complexity measures for data linkage and deduplication,” in Quality measures in data mining.   Springer, 2007, pp. 127–151.
  29. O. Hassanzadeh and R. J. Miller, “Creating probabilistic databases from duplicated data,” The VLDB Journal, vol. 18, no. 5, pp. 1141–1166, 2009.
  30. I. Bhattacharya and L. Getoor, “Collective entity resolution in relational data,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, pp. 5–es, 2007.
  31. Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching with pre-trained language models,” Proceedings of the VLDB Endowment, vol. 14, no. 1, p. 50–60, Sep. 2020. [Online]. Available: http://dx.doi.org/10.14778/3421424.3421431
  32. J. Tang, Y. Zuo, L. Cao, and S. Madden, “Generic entity resolution models,” in NeurIPS 2022 First Table Representation Workshop, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Huahang Li (3 papers)
  2. Longyu Feng (3 papers)
  3. Shuangyin Li (14 papers)
  4. Fei Hao (7 papers)
  5. Chen Jason Zhang (25 papers)
  6. Yuanfeng Song (27 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets