Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks (2305.05862v2)

Published 10 May 2023 in cs.CL and cs.AI
Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks

Abstract: The most recent LLMs(LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the financial domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct an empirical study and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help understand the capability of the existing models in the financial domain and facilitate further improvements.

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks

This paper critically examines the performance of state-of-the-art LLMs such as ChatGPT and GPT-4 in the specialized domain of financial text analytics. The authors employ eight benchmark datasets spanning five categories of tasks: sentiment analysis, classification, named entity recognition (NER), relation extraction (RE), and question answering (QA). These categories cover a broad spectrum of financial text analytics typically encountered in industry settings. The paper provides a comparison of these generalist models with both fine-tuned models and domain-specific pretrained models.

Methodology and Experimental Setup

The researchers meticulously selected datasets representing various aspects of financial text analytics. The datasets include Financial PhraseBank, FiQA Sentiment Analysis, TweetFinSent, headlines classification, FIN3 NER, REFinD, FinQA, and ConvFinQA. This selection represents tasks ranging from relatively straightforward sentiment analysis and classification to more complex tasks such as NER, RE, and QA.

The evaluation metrics utilized for assessment are accuracy, macro-F1 score, and entity-level F1 score, providing a comprehensive understanding of the performance across different task categories.

Key Findings

  1. Performance on Sentiment Analysis and Classification: ChatGPT and GPT-4 show strong performance on sentiment analysis and classification tasks. Specifically, GPT-4 outperforms domain-specific models like FinBert on Financial PhraseBank and achieves competitive scores on FiQA and TweetFinSent datasets, illustrating its robust generalization capabilities in financial contexts. For example, GPT-4 attained a weighted F1 score of 88.11 on the FiQA dataset, indicating its efficacy in analyzing financial sentiment in detailed contexts.
  2. Named Entity Recognition: The models exhibit limitations in structured prediction tasks like NER, with GPT-4 achieving an entity-level F1 score of 56.71 on the FIN3 dataset. This is inferior compared to domain-specific models and CRF models fine-tuned on similar data. The observed performance gap underscores the challenges LLMs face in domain-specific structured prediction tasks.
  3. Relation Extraction: On the REFinD dataset, GPT-4 consistently outperforms ChatGPT but falls short of the performance delivered by fine-tuned models like Luke-base. This suggests that while GPT-4 has an enhanced understanding of entity relations, domain-specific fine-tuning still holds an advantage in extracting complex relations from financial texts.
  4. Question Answering: GPT-4 demonstrates superior performance in numerical reasoning and achieves accuracy rates significantly higher than those of fine-tuned models like FinQANet on tasks such as FinQA and ConvFinQA. For instance, GPT-4 achieved an accuracy of 78.03 with Chain-of-Thought (CoT) prompting on FinQA, surpassing traditional fine-tuned models, indicating the model's enhanced capability in multi-step reasoning and complex numerical operations.

Implications

The findings have several practical and theoretical implications. From a practical standpoint, GPT-4 proves to be a viable candidate for a broad array of financial NLP tasks, potentially reducing the need for domain-specific model fine-tuning. For less complex tasks, generalist models like GPT-4 offer robust baseline performance without the overhead of dataset-specific adjustments.

However, for more complex tasks such as named entity recognition and relation extraction, specialized models still provide superior performance, highlighting the need for ongoing development of domain-specific adaptations to LLM architectures. This underscores an area of future research focused on specialized pretraining and fine-tuning strategies tailored to the financial domain.

From a theoretical perspective, the results emphasize a significant potential for generalist models in handling specialized domains, provided that effective strategies like few-shot learning and CoT prompting are employed. These prompting techniques lift the performance of models significantly, as seen across various datasets.

Future Directions

Future research could explore the intersection of generalist models and domain-specific data, exploring advanced prompting techniques, hybrid models combining generalist and specialized components, and improved pretraining strategies. Additionally, expanding the evaluation to include more diverse financial NLP tasks, such as symbolic reasoning and other complex logical deductions, would provide a more comprehensive understanding of the capabilities and limitations of current LLMs in the financial domain.

In summary, while GPT-4 and ChatGPT are powerful tools for general financial text analytics, more specialized adaptations are still necessary for optimal performance in complex and domain-specific tasks. The paper highlights the promising role of generalist models while encouraging the continued refinement and integration of domain-specific enhancements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
  3. Finqa: A dataset of numerical reasoning over financial data.
  4. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering.
  5. Deep reinforcement learning from human preferences.
  6. Training compute-optimal large language models.
  7. Refind: Relation extraction financial dataset. arXiv preprint arXiv:2305.18322.
  8. How secure is code generated by chatgpt?
  9. Solving quantitative reasoning problems with language models.
  10. Learning better intent representations for financial open intent classification.
  11. Evaluating the logical reasoning ability of chatgpt and gpt-4.
  12. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6).
  13. Www’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  14. Good debt or bad debt: Detecting semantic orientations in economic texts.
  15. Evaluation of sentiment analysis in finance: from lexicons to transformers. IEEE access, 8:131662–131682.
  16. Chatgpt versus traditional question answering for knowledge graphs: Current status and future directions towards knowledge graph chatbots.
  17. The measurement of meaning. University of Illinois Press.
  18. TweetFinSent: A dataset of stock sentiments on Twitter. In Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), pages 37–47, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  19. Is chatgpt a general-purpose natural language processing task solver?
  20. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
  21. What language model to train if you have one million gpu hours?
  22. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083.
  23. Large language models encode clinical knowledge.
  24. Ankur Sinha and Tanmay Khandait. 2020. Impact of news on the commodity market: Dataset and results.
  25. Galactica: A large language model for science.
  26. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  27. Bloomberggpt: A large language model for finance.
  28. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online. Association for Computational Linguistics.
  29. Sharon Yang. 2021. Financial use cases for named entity recognition (ner).
  30. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xianzhi Li (38 papers)
  2. Samuel Chan (1 paper)
  3. Xiaodan Zhu (94 papers)
  4. Yulong Pei (31 papers)
  5. Zhiqiang Ma (19 papers)
  6. Xiaomo Liu (17 papers)
  7. Sameena Shah (33 papers)
Citations (48)