Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination (2311.15548v1)

Published 27 Nov 2023 in cs.CL, cs.AI, cs.LG, and q-fin.ST

Abstract: The hallucination issue is recognized as a fundamental deficiency of LLMs, especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  2. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  3. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  4. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
  5. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764, 2023.
  6. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  7. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  8. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023.
  9. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  10. Can chatgpt improve investment decision? from a portfolio management perspective. From a Portfolio Management Perspective, 2023.
  11. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  12. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  13. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022.
  14. FinGPT: Democratizing internet-scale data for financial large language models. Workshop on Instruction Tuning and Instruction Following at NeurIPS, 2023.
  15. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  16. OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023.
  17. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  18. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  19. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  21. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  22. Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
  23. Performance and exploration of chatgpt in medical examination, records and education in chinese: Pave the way for medical ai. International Journal of Medical Informatics, 177:105173, 2023.
  24. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  25. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443, 2023.
  26. FinGPT: Open-source financial large language models. Symposium on FinLLM, IJCAI, 2023.
  27. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097, 2020.
  28. Instruct-FinGPT: Financial sentiment analysis by instruction tuning of general-purpose large language models. Symposium on FinLLM at IJCAI, 2023a.
  29. Enhancing financial sentiment analysis via retrieval augmented large language models. ACM ICAIF, Nov., 2023.
  30. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
Citations (15)

Summary

  • The paper reveals that mainstream LLMs generate factually incorrect outputs in financial tasks, including misleading acronym expansions, term explanations, and stock price queries.
  • It shows that domain-specific fine-tuning and multi-task learning can worsen hallucination, with smaller models often underperforming compared to larger counterparts.
  • The study demonstrates that techniques like Retrieval-Augmented Generation and prompt-based tool learning effectively mitigate hallucination by grounding outputs in reliable external data.

This paper empirically investigates the problem of "hallucination" – the generation of plausible but factually incorrect information – by LLMs in the financial domain. The authors argue that hallucination poses significant risks in finance, where accuracy is critical.

The paper evaluates several LLMs, including Llama2 variants (7B, 13B, base, and chat), GPT-3.5-turbo, GPT-4, and a finance-specific model, FinMA-7B, on three distinct financial tasks:

  1. Financial Abbreviation Recognition: Assessing the models' ability to correctly expand financial acronyms (e.g., "TIF" to "Tax Increment Financing") and identify company names from stock symbols (e.g., "AAPL" to "Apple Inc.").
  2. Financial Term Explanations: Evaluating the factuality of explanations generated for less common financial terms, using the FactScore metric against Wikipedia content.
  3. Stock Price Query: Testing the models' capability to retrieve accurate historical stock prices for specific tickers on given dates.

Key findings reveal significant hallucination issues:

  • General LLMs Struggle: Even advanced models like GPT-4 showed inaccuracies (e.g., 82.5% accuracy for acronyms, 90.4% for stock symbols, 81.11% FactScore for term explanations), sometimes providing outdated information (like mentioning delisted stocks). Smaller open-source models (Llama2-7B) performed considerably worse.
  • Domain-Specific Fine-tuning Issues: FinMA-7B, despite being fine-tuned on financial tasks, performed worse than its base model (Llama1-7B) on these specific tasks, suggesting that multi-task fine-tuning might degrade general instruction-following and increase certain types of hallucinations.
  • High Unreliability in Price Queries: When asked directly for historical stock prices (without tools), Llama2 models exhibited extremely high Mean Absolute Errors (MAE) (over $6000), rendering their outputs useless for practical purposes. GPT models commendably abstained from answering these questions directly.

The paper also evaluates four methods to mitigate hallucination:

  1. Few-Shot Prompting: Providing examples in the prompt.
  2. Decoding by Contrasting Layers (DoLa): A decoding strategy to enhance factuality.
  3. Retrieval-Augmented Generation (RAG): Supplementing the LLM with information retrieved from an external source (Wikipedia via FAISS).
  4. Prompt-based Tool Learning: Teaching the model to generate correct API calls (specifically for the Alpha Vantage API) to fetch real-time/accurate data.

The effectiveness of these mitigation strategies varied:

  • RAG is Effective: RAG significantly improved accuracy and factuality across tasks (e.g., boosting Llama2-13B-chat's FactScore from 66.72% to 90.67% and acronym accuracy from 75.0% to 93.4%).
  • Tool Learning Excels for Dynamic Data: Prompt-based tool learning dramatically improved performance on the stock price query task. Models learned to generate correct API calls, achieving near-perfect accuracy (e.g., 100% for Llama2-7B-chat+tool and GPT-4+tool with one-shot learning).
  • Few-Shot Prompts Limited: Few-shot learning provided modest gains, primarily improving adherence to the desired output format rather than significantly boosting factual accuracy, especially for chat-tuned models.
  • DoLa Limitations: DoLa showed limited benefit, particularly when the base model lacked the necessary factual knowledge in its pre-training data.

In conclusion, the paper highlights that off-the-shelf LLMs exhibit serious hallucination problems in financial tasks, making them unreliable for direct use where accuracy is paramount. However, techniques like RAG and prompt-based tool learning offer practical and effective ways to mitigate these issues by grounding LLM outputs in external knowledge and real-time data sources, paving the way for more responsible deployment in finance.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com