Emergent Mind

Abstract

Collecting labeled datasets in finance is challenging due to scarcity of domain experts and higher cost of employing them. While LLMs have demonstrated remarkable performance in data annotation tasks on general domain datasets, their effectiveness on domain specific datasets remains underexplored. To address this gap, we investigate the potential of LLMs as efficient data annotators for extracting relations in financial documents. We compare the annotations produced by three LLMs (GPT-4, PaLM 2, and MPT Instruct) against expert annotators and crowdworkers. We demonstrate that the current state-of-the-art LLMs can be sufficient alternatives to non-expert crowdworkers. We analyze models using various prompts and parameter settings and find that customizing the prompts for each relation group by providing specific examples belonging to those groups is paramount. Furthermore, we introduce a reliability index (LLM-RelIndex) used to identify outputs that may require expert attention. Finally, we perform an extensive time, cost and error analysis and provide recommendations for the collection and usage of automated annotations in domain-specific settings.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Content Moderation As a Political Issue: The Twitter Discourse Around Trump’s Ban. Journal of Quantitative Description: Digital Media, 2.
  2. Opinions are made to be changed: Temporally adaptive stance classification. In Proceedings of the 2021 workshop on open challenges in online social networks, 27–32.
  3. Building for tomorrow: Assessing the temporal persistence of text classifiers. Information Processing & Management, 60(2): 103200.
  4. PaLM 2 Technical Report
  5. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.
  6. Can Large Language Models Be an Alternative to Human Evaluations? In Rogers, A.; Boyd-Graber, J.; and Okazaki, N., eds., Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15607–15631. Toronto, Canada: Association for Computational Linguistics.
  7. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2924–2936. Minneapolis, Minnesota: Association for Computational Linguistics.
  8. Cohen, J. 1960a. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1): 37–46.
  9. Cohen, J. 1960b. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37–46. Accessed: 2023-07-24.
  10. Is GPT-3 a Good Data Annotator? In Rogers, A.; Boyd-Graber, J.; and Okazaki, N., eds., Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 11173–11195. Toronto, Canada: Association for Computational Linguistics.
  11. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 345–363.
  12. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 889–898.
  13. Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5): 378.
  14. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30): e2305016120.
  15. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators
  16. Is ChatGPT Better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, 294–297. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394192.
  17. REFinD: Relation Extraction Financial Dataset. SIGIR ’23. Taipei, Taiwan: Association for Computing Machinery.
  18. ChatGPT: Beginning of an End of Manual Annotation? Use Case of Automatic Genre Identification
  19. Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks. In Wang, M.; and Zitouni, I., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, 408–422. Singapore: Association for Computational Linguistics.
  20. Can Small and Synthetic Benchmarks Drive Modeling Innovation? A Retrospective Study of Question Answering Modeling Approaches
  21. MosaicML, N. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.
  22. GPT-4 Technical Report
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
  24. Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 26837–26867. PMLR.
  25. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1267–1273. Minneapolis, Minnesota: Association for Computational Linguistics.
  26. Plank, B. 2022. The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. In Goldberg, Y.; Kozareva, Z.; and Zhang, Y., eds., Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 10671–10682. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  27. Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark
  28. FinRED: A dataset for relation extraction in financial domain. In Companion Proceedings of the Web Conference 2022, 595–597.
  29. Beyond Classification: Financial Reasoning in State-of-the-Art Language Models. In Chen, C.-C.; Takamura, H.; Mathur, P.; Sawhney, R.; Huang, H.-H.; and Chen, H.-H., eds., Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, 34–44. Macao: -.
  30. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning
  31. The Twitter parliamentarian database: Analyzing Twitter politics across 26 countries. PLoS one, 15(9): e0237073.
  32. Want To Reduce Labeling Cost? GPT-3 Can Help. In Findings of the Association for Computational Linguistics: EMNLP 2021, 4195–4205. Punta Cana, Dominican Republic: Association for Computational Linguistics.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  34. Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks

Show All 34