Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams (2310.08678v1)

Published 12 Oct 2023 in cs.CL, cs.AI, and q-fin.GN

Abstract: LLMs have demonstrated remarkable performance on a wide range of NLP tasks, often matching or even beating state-of-the-art task-specific models. This study aims at assessing the financial reasoning capabilities of LLMs. We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams. Finally, we outline insights into potential strategies and improvements to enhance the applicability of LLMs in finance. In this perspective, we hope this work paves the way for future studies to continue enhancing LLMs for financial reasoning through rigorous evaluation.

Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams

The paper presents a comprehensive evaluation of two prominent LLMs, ChatGPT and GPT-4, on their ability to perform financial reasoning through the lens of mock Chartered Financial Analyst (CFA) exams. The paper aims to scrutinize whether these LLMs can effectively address and solve questions from CFA Levels I and II, which are known for their rigorous testing of finance-related skills.

The experiment involved a detailed examination of the models under Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) prompting scenarios. The results showed that GPT-4 outperformed ChatGPT across most categories, especially in topics like Derivatives and Alternative Investments. However, both models faced challenges with topics that heavily require extensive financial domain knowledge and nuanced problem-solving, such as Financial Reporting and Portfolio Management.

The observation that GPT-4 consistently surpasses ChatGPT in terms of test accuracy reflects the enhanced capabilities of GPT-4 in reasoning and understanding complex problem spaces. For instance, GPT-4 scored higher in Level I with a peak accuracy of 74.6% in the FS setting, compared to ChatGPT's best score of 63.0% in a similar condition. In Level II, GPT-4 also showed commendable performance, achieving a top score suggesting that it potentially passes the CFA Level II exam depending on configuration.

A notable finding is the limited efficacy of CoT prompting for ChatGPT, with minor improvements noted when compared to ZS prompting on Level I exams. Although CoT seemingly enhances the ability of models to parse complex information by encouraging step-by-step reasoning, it simultaneously exposes gaps in domain-specific knowledge and computational accuracy, leading to a range of errors. GPT-4 showed a more pronounced benefit from CoT in some contexts, especially when tackling the elaborate Level II questions, although this did not universally surpass FS prompting.

The implications of this paper extend to the potential use of LLMs in financial domains. While the models demonstrate a level of competency in financial reasoning under specific conditions, they still show significant limitations when faced with intricate domain-specific problems. This insight calls for a layered approach to enhance LLMs' capabilities, suggesting a potential integration of specialized knowledge bases and advanced computation tools to mitigate their current deficiencies.

Future prospects involve amplifying the financial reasoning capabilities of LLMs by incorporating retrieval-augmented generation paradigms, integrating domain-specific pre-training, and potentially leveraging external calculation modules. Emphasizing the combination of FS and CoT could also yield models with improved problem-solving accuracy. As such, the research opens a path for further explorations into optimizing LLMs for specialized applications within the financial sector, potentially reshaping the landscape of automated financial analysis tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. 2021. An exploration of automatic text summarization of financial reports. In Proceedings of the Third Workshop on Financial Technology and Natural Language Processing, pages 1–7.
  2. Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063.
  3. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
  4. 2023. Leveraging bert for extractive text summarization on federal police documents. Knowledge and Information Systems, pages 1–31.
  5. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. 2022. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering.
  7. 2022. Palm: Scaling language modeling with pathways.
  8. 2023. Mathematical capabilities of chatgpt.
  9. 2023. Gpt-4 passes the bar exam. Available at SSRN 4389233.
  10. 2020. Unifiedqa: Crossing format boundaries with a single qa system.
  11. 2022. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
  12. 2023. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. plos digit health 2 (2): e0000198.
  13. 2023. Causal reasoning and large language models: Opening a new frontier for causality.
  14. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
  15. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  16. 2022. Unified named entity recognition as word-word relation classification.
  17. 2023. Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? an examination on several typical tasks. arXiv preprint arXiv:2305.05862.
  18. 2023. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938.
  19. 2023. Exploring the effectiveness of gpt models in test-taking: A case study of the driver’s license knowledge test. arXiv preprint arXiv:2308.11827.
  20. 2023. Performance of chatgpt on free-response, clinical reasoning exams. medRxiv, pages 2023–03.
  21. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  22. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  23. 2022. A numerical reasoning question answering system with fine-grained retriever and the ensemble of multiple generators for finqa. arXiv preprint arXiv:2206.08506.
  24. 2023. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635.
  25. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  27. 2023a. Exploitgen: Template-augmented exploit code generation based on codebert. Journal of Systems and Software, 197:111577.
  28. 2023b. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
  29. 2023. Extractive summarization via chatgpt for faithful summary generation. arXiv preprint arXiv:2304.04193.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ethan Callanan (3 papers)
  2. Amarachi Mbakwe (1 paper)
  3. Antony Papadimitriou (3 papers)
  4. Yulong Pei (31 papers)
  5. Mathieu Sibue (5 papers)
  6. Xiaodan Zhu (94 papers)
  7. Zhiqiang Ma (19 papers)
  8. Xiaomo Liu (17 papers)
  9. Sameena Shah (33 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com