Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration (2312.03987v1)
Abstract: Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained LLMs (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.
- Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan, “Deep entity matching with pre-trained language models,” Proc. VLDB Endow., vol. 14, no. 1, pp. 50–60, 2020.
- R. Peeters and C. Bizer, “Dual-objective fine-tuning of bert for entity matching,” Proc. VLDB Endow., vol. 14, no. 10, p. 1913–1921, 2021.
- M. Akbarian Rastaghi, E. Kamalloo, and D. Rafiei, “Probing the robustness of pre-trained language models for entity matching,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, p. 3786–3790.
- J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, and S. Gao, “Unicorn: A unified multi-tasking model for supporting matching tasks in data integration,” Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–26, 2023.
- J. Tu, J. Fan, N. Tang, P. Wang, C. Chai, G. Li, R. Fan, and X. Du, “Domain adaptation for deep entity resolution,” in Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 443–457.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in NeurIPS 2020, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
- S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” arXiv preprint arXiv:2202.12837, 2022.
- J. Chen, L. Chen, and T. Zhou, “It takes one to tango but more make trouble? in-context training with different number of demonstrations,” arXiv preprint arXiv:2303.08119, 2023.
- L. Gao, A. Chaudhary, K. Srinivasan, K. Hashimoto, K. Raman, and M. Bendersky, “Ambiguity-aware in-context learning with large language models,” arXiv preprint arXiv:2309.07900, 2023.
- X. Wang, Y. Wang, C. Xu, X. Geng, B. Zhang, C. Tao, F. Rudzicz, R. E. Mercer, and D. Jiang, “Investigating the learning behaviour of in-context learning: A comparison with supervised learning,” arXiv preprint arXiv:2307.15411, 2023.
- A. Narayan, I. Chami, L. J. Orr, and C. Ré, “Can foundation models wrangle your data?” Proc. VLDB Endow., vol. 16, no. 4, pp. 738–746, 2022. [Online]. Available: https://www.vldb.org/pvldb/vol16/p738-narayan.pdf
- R. Peeters and C. Bizer, “Entity matching using large language models,” CoRR, vol. abs/2310.11244, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.11244
- O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022. Association for Computational Linguistics, 2022, pp. 2655–2671. [Online]. Available: https://doi.org/10.18653/v1/2022.naacl-main.191
- Y. Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?” arXiv preprint arXiv:2301.13670, 2023.
- X. Li, K. Lv, H. Yan, T. Lin, W. Zhu, Y. Ni, G. Xie, X. Wang, and X. Qiu, “Unified demonstration retriever for in-context learning,” arXiv preprint arXiv:2305.04320, 2023.
- S. Agrawal, C. Zhou, M. Lewis, L. Zettlemoyer, and M. Ghazvininejad, “In-context examples selection for machine translation,” in ACL, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 8857–8873. [Online]. Available: https://doi.org/10.18653/v1/2023.findings-acl.564
- G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, “Blocking and filtering techniques for entity resolution: A survey,” ACM Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–42, 2020.
- S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan, “Deep learning for blocking in entity matching: a design space exploration,” Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2459–2472, 2021.
- C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “Collaborem: a self-supervised entity matching framework using multi-features collaboration,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” arXiv preprint arXiv:2104.08786, 2021.
- Y. Chen, C. Zhao, Z. Yu, K. McKeown, and H. He, “On the relation between sensitivity and accuracy in in-context learning,” arXiv preprint arXiv:2209.07661, 2022.
- Z. Wan, F. Cheng, Z. Mao, Q. Liu, H. Song, J. Li, and S. Kurohashi, “Gpt-re: In-context learning for relation extraction using large language models,” arXiv preprint arXiv:2305.02105, 2023.
- Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with large language model apis,” arXiv preprint arXiv:2301.08721, 2023.
- M. Luo, X. Xu, Z. Dai, P. Pasupat, M. Kazemi, C. Baral, V. Imbrasaite, and V. Y. Zhao, “Dr. icl: Demonstration-retrieved in-context learning,” arXiv preprint arXiv:2305.14128, 2023.
- K. Margatina, T. Schick, N. Aletras, and J. Dwivedi-Yu, “Active learning principles for in-context learning with large language models,” arXiv preprint arXiv:2305.14264, 2023.
- H. Zhang, Y. Dong, C. Xiao, and M. Oyamada, “Large language models as data preprocessors,” arXiv preprint arXiv:2308.16361, 2023.
- M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, E. Simoudis, J. Han, and U. M. Fayyad, Eds., 1996, pp. 226–231.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
- J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for gpt-3?” in Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022. Association for Computational Linguistics, 2022, pp. 100–114. [Online]. Available: https://doi.org/10.18653/v1/2022.deelio-1.10
- K. Bernhard and J. Vygen, “Combinatorial optimization: Theory and algorithms,” Springer, Third Edition, 2005., 2008.
- P. Slavík, “A tight analysis of the greedy algorithm for set cover,” in Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 435–441.
- A. Doan, P. Konda, P. S. G. C., Y. Govind, D. Paulsen, K. Chandrasekhar, P. Martinkus, and M. Christie, “Magellan: toward building ecosystems of entity matching solutions,” Commun. ACM, vol. 63, no. 8, pp. 83–91, 2020. [Online]. Available: https://doi.org/10.1145/3405476
- S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: A design space exploration,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 19–34.
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: Crowdsourcing entity resolution,” Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012. [Online]. Available: http://vldb.org/pvldb/vol5/p1483\_jiannanwang\_vldb2012.pdf
- (2021) Code of jointbert. [Online]. Available: https://github.com/wbsg-uni-mannheim/jointbert
- (2022) Code of robem. [Online]. Available: https://github.com/makbn/robem
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” CoRR, vol. abs/2307.09288, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.09288
- M. Ebraheem, S. Thirumuruganathan, S. R. Joty, M. Ouzzani, and N. Tang, “Distributed representations of tuples for entity resolution,” Proc. VLDB Endow., vol. 11, no. 11, pp. 1454–1467, 2018. [Online]. Available: http://www.vldb.org/pvldb/vol11/p1454-ebraheem.pdf
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extractors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022.
- M. Kayali, A. Lykov, I. Fountalis, N. Vasiloglou, D. Olteanu, and D. Suciu, “CHORUS: foundation models for unified data discovery and exploration,” CoRR, vol. abs/2306.09610, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.09610
- N. Guan, K. Chen, and N. Koudas, “Can large language models design accurate label functions?” CoRR, vol. abs/2311.00739, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.00739
- Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate empathetic dialogues? A novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation,” in Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022. International Committee on Computational Linguistics, 2022, pp. 669–683. [Online]. Available: https://aclanthology.org/2022.coling-1.56
- I. Levy, B. Bogin, and J. Berant, “Diverse demonstrations improve in-context compositional generalization,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 2023, pp. 1401–1422. [Online]. Available: https://doi.org/10.18653/v1/2023.acl-long.78
- H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and T. Yu, “Selective annotation makes language models better few-shot learners,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=qY1hlv7gwg
- Meihao Fan (4 papers)
- Xiaoyue Han (1 paper)
- Ju Fan (26 papers)
- Chengliang Chai (9 papers)
- Nan Tang (63 papers)
- Guoliang Li (125 papers)
- Xiaoyong Du (40 papers)