Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams
The paper presents a comprehensive evaluation of two prominent LLMs, ChatGPT and GPT-4, on their ability to perform financial reasoning through the lens of mock Chartered Financial Analyst (CFA) exams. The paper aims to scrutinize whether these LLMs can effectively address and solve questions from CFA Levels I and II, which are known for their rigorous testing of finance-related skills.
The experiment involved a detailed examination of the models under Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) prompting scenarios. The results showed that GPT-4 outperformed ChatGPT across most categories, especially in topics like Derivatives and Alternative Investments. However, both models faced challenges with topics that heavily require extensive financial domain knowledge and nuanced problem-solving, such as Financial Reporting and Portfolio Management.
The observation that GPT-4 consistently surpasses ChatGPT in terms of test accuracy reflects the enhanced capabilities of GPT-4 in reasoning and understanding complex problem spaces. For instance, GPT-4 scored higher in Level I with a peak accuracy of 74.6% in the FS setting, compared to ChatGPT's best score of 63.0% in a similar condition. In Level II, GPT-4 also showed commendable performance, achieving a top score suggesting that it potentially passes the CFA Level II exam depending on configuration.
A notable finding is the limited efficacy of CoT prompting for ChatGPT, with minor improvements noted when compared to ZS prompting on Level I exams. Although CoT seemingly enhances the ability of models to parse complex information by encouraging step-by-step reasoning, it simultaneously exposes gaps in domain-specific knowledge and computational accuracy, leading to a range of errors. GPT-4 showed a more pronounced benefit from CoT in some contexts, especially when tackling the elaborate Level II questions, although this did not universally surpass FS prompting.
The implications of this paper extend to the potential use of LLMs in financial domains. While the models demonstrate a level of competency in financial reasoning under specific conditions, they still show significant limitations when faced with intricate domain-specific problems. This insight calls for a layered approach to enhance LLMs' capabilities, suggesting a potential integration of specialized knowledge bases and advanced computation tools to mitigate their current deficiencies.
Future prospects involve amplifying the financial reasoning capabilities of LLMs by incorporating retrieval-augmented generation paradigms, integrating domain-specific pre-training, and potentially leveraging external calculation modules. Emphasizing the combination of FS and CoT could also yield models with improved problem-solving accuracy. As such, the research opens a path for further explorations into optimizing LLMs for specialized applications within the financial sector, potentially reshaping the landscape of automated financial analysis tools.