BoostER: Leveraging Large Language Models for Enhancing Entity Resolution (2403.06434v1)
Abstract: Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of LLMs like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.
- Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39–48.
- P Christen. [n. d.]. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. 2012.
- J De Bruin. 2019. Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python. https://doi.org/10.5281/zenodo.3559043
- Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1–16.
- Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe
- Probabilistic topic models: Foundation and application. Springer.
- On Leveraging Large Language Models for Enhancing Entity Resolution. arXiv preprint arXiv:2401.03426 (2024).
- Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
- Generic entity resolution models. In NeurIPS 2022 First Table Representation Workshop.
- Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
- William E Winkler. 2014. Matching and record linkage. Wiley interdisciplinary reviews: Computational statistics 6, 5 (2014), 313–325.
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Huahang Li (3 papers)
- Shuangyin Li (14 papers)
- Fei Hao (7 papers)
- Chen Jason Zhang (25 papers)
- Yuanfeng Song (27 papers)
- Lei Chen (485 papers)