Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination (2311.15548v1)

Published 27 Nov 2023 in cs.CL, cs.AI, cs.LG, and q-fin.ST

Abstract: The hallucination issue is recognized as a fundamental deficiency of LLMs, especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.

References (30)

Citations (15)

View on Semantic Scholar

Summary

The paper reveals that mainstream LLMs generate factually incorrect outputs in financial tasks, including misleading acronym expansions, term explanations, and stock price queries.
It shows that domain-specific fine-tuning and multi-task learning can worsen hallucination, with smaller models often underperforming compared to larger counterparts.
The study demonstrates that techniques like Retrieval-Augmented Generation and prompt-based tool learning effectively mitigate hallucination by grounding outputs in reliable external data.

This paper empirically investigates the problem of "hallucination" – the generation of plausible but factually incorrect information – by LLMs in the financial domain. The authors argue that hallucination poses significant risks in finance, where accuracy is critical.

The paper evaluates several LLMs, including Llama2 variants (7B, 13B, base, and chat), GPT-3.5-turbo, GPT-4, and a finance-specific model, FinMA-7B, on three distinct financial tasks:

Financial Abbreviation Recognition: Assessing the models' ability to correctly expand financial acronyms (e.g., "TIF" to "Tax Increment Financing") and identify company names from stock symbols (e.g., "AAPL" to "Apple Inc.").
Financial Term Explanations: Evaluating the factuality of explanations generated for less common financial terms, using the FactScore metric against Wikipedia content.
Stock Price Query: Testing the models' capability to retrieve accurate historical stock prices for specific tickers on given dates.

Key findings reveal significant hallucination issues:

General LLMs Struggle: Even advanced models like GPT-4 showed inaccuracies (e.g., 82.5% accuracy for acronyms, 90.4% for stock symbols, 81.11% FactScore for term explanations), sometimes providing outdated information (like mentioning delisted stocks). Smaller open-source models (Llama2-7B) performed considerably worse.
Domain-Specific Fine-tuning Issues: FinMA-7B, despite being fine-tuned on financial tasks, performed worse than its base model (Llama1-7B) on these specific tasks, suggesting that multi-task fine-tuning might degrade general instruction-following and increase certain types of hallucinations.
High Unreliability in Price Queries: When asked directly for historical stock prices (without tools), Llama2 models exhibited extremely high Mean Absolute Errors (MAE) (over $6000), rendering their outputs useless for practical purposes. GPT models commendably abstained from answering these questions directly.

The paper also evaluates four methods to mitigate hallucination:

Few-Shot Prompting: Providing examples in the prompt.
Decoding by Contrasting Layers (DoLa): A decoding strategy to enhance factuality.
Retrieval-Augmented Generation (RAG): Supplementing the LLM with information retrieved from an external source (Wikipedia via FAISS).
Prompt-based Tool Learning: Teaching the model to generate correct API calls (specifically for the Alpha Vantage API) to fetch real-time/accurate data.

The effectiveness of these mitigation strategies varied:

RAG is Effective: RAG significantly improved accuracy and factuality across tasks (e.g., boosting Llama2-13B-chat's FactScore from 66.72% to 90.67% and acronym accuracy from 75.0% to 93.4%).
Tool Learning Excels for Dynamic Data: Prompt-based tool learning dramatically improved performance on the stock price query task. Models learned to generate correct API calls, achieving near-perfect accuracy (e.g., 100% for Llama2-7B-chat+tool and GPT-4+tool with one-shot learning).
Few-Shot Prompts Limited: Few-shot learning provided modest gains, primarily improving adherence to the desired output format rather than significantly boosting factual accuracy, especially for chat-tuned models.
DoLa Limitations: DoLa showed limited benefit, particularly when the base model lacked the necessary factual knowledge in its pre-training data.

In conclusion, the paper highlights that off-the-shelf LLMs exhibit serious hallucination problems in financial tasks, making them unreliable for direct use where accuracy is paramount. However, techniques like RAG and prompt-based tool learning offer practical and effective ways to mitigate these issues by grounding LLM outputs in external knowledge and real-time data sources, paving the way for more responsible deployment in finance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SwissWealthHub/status/1800162367046856983

YouTube

Show All Videos