Cost-Efficient Prompt Engineering for Unsupervised Entity Resolution (2310.06174v2)
Abstract: Entity Resolution (ER) is the problem of semi-automatically determining when two entities refer to the same underlying entity, with applications ranging from healthcare to e-commerce. Traditional ER solutions required considerable manual expertise, including domain-specific feature engineering, as well as identification and curation of training data. Recently released LLMs provide an opportunity to make ER more seamless and domain-independent. However, it is also well known that LLMs can pose risks, and that the quality of their outputs can depend on how prompts are engineered. Unfortunately, a systematic experimental study on the effects of different prompting methods for addressing unsupervised ER, using LLMs like ChatGPT, has been lacking thus far. This paper aims to address this gap by conducting such a study. We consider some relatively simple and cost-efficient ER prompt engineering methods and apply them to ER on two real-world datasets widely used in the community. We use an extensive set of experimental results to show that an LLM like GPT3.5 is viable for high-performing unsupervised ER, and interestingly, that more complicated and detailed (and hence, expensive) prompting methods do not necessarily outperform simpler approaches. We provide brief discussions on qualitative and error analysis, including a study of the inter-consistency of different prompting methods to determine whether they yield stable outputs. Finally, we consider some limitations of LLMs when applied to ER.
- An ensemble blocking scheme for entity resolution of large and sparse datasets. arXiv preprint arXiv:1609.06265 (2016).
- An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR) 53, 6 (2020), 1–42.
- Entity resolution in the web of data. Vol. 5. Springer.
- Interpreting deep learning models for entity resolution: an experience report using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–4.
- DeepER–Deep Entity Resolution. arXiv preprint arXiv:1710.00597 (2017).
- Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1–16.
- Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018–2019.
- Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1527–1527.
- Mozhdeh Gheini and Mayank Kejriwal. 2019. Unsupervised Product Entity Resolution using Graph Representation Learning.. In eCOM@ SIGIR.
- Tanya Gupta and Varad Deshpande. 2020. Entity resolution for maintaining electronic medical record using oyster. In EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing: BDCC 2018. Springer, 41–50.
- Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).
- Mayank Kejriwal. 2014. Populating entity name systems for big data integration. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13. Springer, 521–528.
- Mayank Kejriwal. 2016. Populating a linked data entity name system: A big data solution to unsupervised instance matching. Vol. 27. IOS Press.
- Mayank Kejriwal. 2021. Unsupervised DNF blocking for efficient linking of knowledge graphs and tables. Information 12, 3 (2021), 134.
- Mayank Kejriwal. 2022. Knowledge graphs: A practical review of the research landscape. Information 13, 4 (2022), 161.
- Mayank Kejriwal. 2023. Named Entity Resolution in Personal Knowledge Graphs. arXiv preprint arXiv:2307.12173 (2023).
- Mayank Kejriwal and Mayank Kejriwal. 2019. Entity resolution. Domain-Specific Knowledge Graph Construction (2019), 33–57.
- Knowledge graphs: Fundamentals, techniques, and applications. MIT Press.
- Mayank Kejriwal and Daniel P Miranker. 2013. An unsupervised algorithm for learning blocking schemes. In 2013 IEEE 13th International Conference on Data Mining. IEEE, 340–349.
- Mayank Kejriwal and Daniel P Miranker. 2014. A two-step blocking scheme learner for scalable link discovery. OM 14 (2014), 49–60.
- Mayank Kejriwal and Daniel P Miranker. 2015a. Sorted neighborhood for schema-free RDF data. In European Semantic Web Conference. Springer, 217–229.
- Mayank Kejriwal and Daniel P Miranker. 2015b. An unsupervised instance matcher for schema-free RDF data. Journal of Web Semantics 35 (2015), 102–123.
- Deep learning based approach for entity resolution in databases. In Asian conference on intelligent information and database systems. Springer, 3–12.
- A survey on blocking technology of entity resolution. Journal of Computer Science and Technology 35 (2020), 769–793.
- Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).
- Active deep learning on entity resolution by risk sampling. Knowledge-Based Systems 236 (2022), 107729.
- A review of unsupervised and semi-supervised blocking methods for record linkage. Linking and Mining Heterogeneous and Multi-view Data (2019), 79–105.
- The four generations of entity resolution. Springer.
- Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR) 53, 2 (2020), 1–42.
- John R Talburt. 2011. Entity resolution and information quality. Elsevier.
- Silk-a link discovery framework for the web of data. Ldow 538 (2009), 53.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
- Khanin Sisaengsuwanchai (1 paper)
- Navapat Nananukul (4 papers)
- Mayank Kejriwal (48 papers)