Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Numerical Reasoning for Financial Reports (2312.14870v1)

Published 22 Dec 2023 in cs.CL

Abstract: Financial reports offer critical insights into a company's operations, yet their extensive length typically spanning 30 40 pages poses challenges for swift decision making in dynamic markets. To address this, we leveraged finetuned LLMs to distill key indicators and operational metrics from these reports basis questions from the user. We devised a method to locate critical data, and leverage the FinQA dataset to fine-tune both Llama-2 7B and T5 models for customized question answering. We achieved results comparable to baseline on the final numerical answer, a competitive accuracy in numerical reasoning and calculation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Conciseness, financial disclosure, and market reaction: A textual analysis of annual reports in listed chinese companies. International Journal of Financial Studies, 10(4), 2022. ISSN 2227-7072. doi: 10.3390/ijfs10040104. URL https://www.mdpi.com/2227-7072/10/4/104.
  2. Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models, 2019.
  3. Benchmarking llm powered chatbots: Methods and metrics, 2023.
  4. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1026–1036, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.91. URL https://aclanthology.org/2020.findings-emnlp.91.
  5. Finqa: A dataset of numerical reasoning over financial data, 2022.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  8. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019.
  9. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  10. Context-ner: Contextual phrase generation at scale. arXiv preprint arXiv:2109.08079, 2021.
  11. Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539, 2023a.
  12. “john is 50 years old, can his son be 65?” evaluating nlp models’ understanding of feasibility. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  407–417, 2023b.
  13. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp.  5549–5581. PMLR, 2023.
  14. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  15. Realtime qa: What’s the answer right now?, 2022.
  16. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  17. Financial reporting quality on investors’ decisions. International Journal of Economics and Financial Research, 2:140–147, 08 2016.
  18. Transformer-based models for question answering on covid19, 2021.
Citations (1)

Summary

  • The paper introduces a pipeline that leverages fine-tuned LLMs to extract tables and perform numerical reasoning on extensive financial reports.
  • The paper contrasts naive and LLM-based table serialization, with the latter achieving higher ROUGE scores and more accurate semantic conversion.
  • The paper integrates FAISS for context search and a post-processing calculator to significantly improve the precision of final numerical answers.

Analyzing large financial reports, often 30-40 pages long, presents a significant challenge for timely decision-making. The research presented in the paper "Numerical Reasoning for Financial Reports" (2312.14870) addresses this by proposing a pipeline that leverages fine-tuned LLMs to extract key numerical indicators and perform reasoning based on user questions. The core focus is on enhancing numerical reasoning specifically for financial reports, particularly interpreting and deriving insights from tabulated data, and building an end-to-end system to extract and generate insights from PDF reports.

The proposed system operates through a sequence of practical steps:

  1. PDF Parsing and Table Extraction: The initial step involves extracting tables from the PDF reports. The authors explored various OCR-based techniques, finding PyTabula to be effective. However, a key implementation challenge noted was the tool's difficulty in handling atypical table formats, such as those with complex subheadings or subcolumns, and its tendency to misinterpret infographics as tables. This highlights the need for robust pre-processing or more advanced table detection methods in a real-world application.
  2. Table to Text Serialization: Converting the extracted structured table data into a sequential text format is crucial for LLM processing. The paper contrasts two serialization approaches:
    • Naive Serialization: Incorporates header information to guide the structure. While simple, it struggles with complex multi-index tables.
    • LLM Serialization: Uses another LLM (initially GPT-3.5, later replicated with Llama-7b chat) to generate a more semantically relevant text representation of the table. Experiments showed this approach, particularly using Llama-7b chat with few-shot prompting compared to GPT-3.5 as a benchmark, resulted in better ROUGE scores (Fewshot Llama: RougeL 0.4353 vs Zeroshot Llama: RougeL 0.3265), indicating a more accurate and meaningful conversion which is vital for downstream LLM performance. Implementing this requires integrating an LLM API or a local LLM for the serialization step, adding to computational overhead.
  3. Context Search: To manage the context length limitations of LLMs and improve focus, larger serialized texts are segmented into smaller chunks. The system uses the FAISS library, a dense vector search library, to find the most relevant text chunk(s) for a given question by computing similarity scores between the question embedding and the chunk embeddings. The chunk(s) with the highest score are selected as context. This involves generating embeddings for all chunks and indexing them in FAISS, which requires memory and computational resources, especially for very large document sets.
  4. Numerical Question Answering with LLMs: The core reasoning is performed by fine-tuned LLMs, using the original question and the selected context. The paper experimented with T5 (an encoder-decoder model) and Llama-2-7b/Llama-2-7b-chat (decoder-only models). Fine-tuning was performed, for Llama models, using the QLoRA technique to enhance computational efficiency. The FinQA dataset, containing financial reports, questions, and annotated numerical reasoning programs, was used for training and evaluation.
    • Implementation Details: Fine-tuning T5 involved using Adam optimizer with a learning rate of 0.0001, batch size 8, for 12 epochs. For Llama-2-7b-chat, Quantized LoRA fine-tuning was applied with 4-bit quantization, 16-bit floating-point computation, rank 16, and projection dimension 64, trained with SFTTrainer, batch size 4 on an A100 GPU. Few-shot prompting was also explored as an alternative to fine-tuning.
  5. Post Processing: Since LLMs can output text that describes calculations but may not perform them accurately, a post-processing step is crucial. The authors structured the input prompts during fine-tuning and few-shot learning to facilitate the extraction of operators and numerical arguments using regular expressions. These extracted components are then used with a separate calculator (e.g., a Python interpreter) to compute the final numerical answer. This addresses the LLM's weakness in precise arithmetic, as demonstrated by the results showing significantly higher accuracy when using a calculator on the extracted arguments and operators compared to relying solely on the model's computed result. This step is a practical work-around to leverage the LLM's understanding of the problem and relevant numbers while offloading the exact computation.

Experimental Results and Practical Implications:

  • T5 with Naive Serialization: Showed high accuracy (94.9%) in identifying numerical operators but lower accuracy (around 65%) for extracting arguments. Crucially, the model's inherent calculation accuracy was poor (11.8%), but using a post-processing calculator boosted the final result accuracy to over 60% (allowing for 10% deviation). This highlights that for numerical tasks, pairing an LLM with an external calculator is often more effective than relying on the LLM alone. T5 also showed a tendency to predict unnecessary steps.
  • Llama with LLM Serialization: Compared few-shot prompting and QLoRA fine-tuning on Llama-2-7b and Llama-2-7b-chat models. QLoRA fine-tuning, particularly on the Llama-2-7b-chat model, consistently showed better performance across metrics (Exact Match for arguments, operators, and result; lower result deviation; higher RougeL scores). For Llama-2-7b-chat (Finetuned - QLoRA), Exact Match scores were 51.40% for Argument 1, 52.70% for Argument 2, 88.60% for Operator, and 20.00% for the Result calculated by the model. Using the post-processing calculator significantly improved the accuracy of the final numerical answer, although specific accuracy percentages for the computed result via calculator were not explicitly provided in the same format as T5 results, the mean deviation analysis in Table 6 confirms this improvement. The results suggest that fine-tuning helps the model better identify the correct arguments and operators from complex text and tables, which then allows the post-processing calculator to derive a more accurate final answer.
  • Trade-offs: The choice between Naive and LLM serialization involves a trade-off between implementation complexity/cost and semantic accuracy. Using an external LLM for serialization adds complexity and potentially cost but improves the input quality for the downstream QA model. Similarly, using a post-processing calculator adds a step but drastically improves the reliability of the final numerical answer compared to the LLM's inherent calculation ability. Fine-tuning LLMs with QLoRA offers a way to improve performance while managing computational requirements compared to full fine-tuning.
  • Limitations: The observed mean deviations in Llama results, even with the calculator, were sometimes substantial, particularly with large numbers. The authors attribute this partly to potential deficiencies in the input dataset, suggesting that data quality and handling of large numerical values remain practical challenges. Common discrepancies included incorrect argument extraction, operator errors, and issues with handling large numbers.

Implementation Considerations:

  • Deploying this system requires integrating components for PDF parsing (potentially involving OCR and table structure recognition), table serialization (either rule-based or LLM-based), text chunking, embedding generation, FAISS indexing and search, an LLM for QA (fine-tuned with QLoRA for efficiency), and a post-processing logic with a calculator.
  • Computational resources for embedding generation, FAISS indexing, and running the LLM (even with QLoRA) need to be considered, especially for processing a large volume of reports.
  • Handling diverse and complex table structures in real-world financial reports is a critical challenge that requires robust table extraction and serialization methods.
  • The performance on large numerical values might need specific attention, possibly requiring additional techniques or data augmentation focused on large numbers.
  • Integrating Program-Aided LLMs (PALs), as suggested for future work, could potentially streamline the process by allowing the LLM to generate code snippets for calculation directly.

In summary, the research provides a practical blueprint for building a numerical reasoning system for financial reports by combining PDF processing, structured data serialization, semantic search, fine-tuned LLMs, and external calculation. While demonstrating promising results, it also highlights critical implementation challenges related to table parsing, serialization accuracy, LLM numerical precision, and the handling of large numbers, guiding future development efforts for real-world financial analysis applications.