Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FinTextQA: A Dataset for Long-form Financial Question Answering (2405.09980v1)

Published 16 May 2024 in cs.CL and cs.AI

Abstract: Accurate evaluation of financial question answering (QA) systems necessitates a comprehensive dataset encompassing diverse question types and contexts. However, current financial QA datasets lack scope diversity and question complexity. This work introduces FinTextQA, a novel dataset for long-form question answering (LFQA) in finance. FinTextQA comprises 1,262 high-quality, source-attributed QA pairs extracted and selected from finance textbooks and government agency websites.Moreover, we developed a Retrieval-Augmented Generation (RAG)-based LFQA system, comprising an embedder, retriever, reranker, and generator. A multi-faceted evaluation approach, including human ranking, automatic metrics, and GPT-4 scoring, was employed to benchmark the performance of different LFQA system configurations under heightened noisy conditions. The results indicate that: (1) Among all compared generators, Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy score; (2) The most effective system configuration on our dataset involved setting the embedder, retriever, reranker, and generator as Ada2, Automated Merged Retrieval, Bge-Reranker-Base, and Baichuan2-7B, respectively; (3) models are less susceptible to noise after the length of contexts reaching a specific threshold.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609.
  2. Younes Belkada. 2023. Bitsandbytes.
  3. Wikihowqa: A comprehensive benchmark for multi-document non-factoid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314.
  4. Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122.
  5. Lift yourself up: Retrieval-augmented text generation with self memory. arXiv preprint arXiv:2305.02437.
  6. A historical perspective of explainable artificial intelligence. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(1):e1391.
  7. Andrei Fajardo. 2023. Llm reranker demonstration (great gatsby).
  8. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  10. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  11. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  12. Prewome: Exploiting presuppositions as working memory for long form question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8312–8322.
  13. Ivan Ilin. 2023. Advanced rag techniques: an illustrated overview.
  14. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.
  15. Sylvain Gugger Joao Gante. 2022. [link].
  16. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
  17. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  18. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
  19. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  20. Jerry Liu. 2023. Auto merging retriever.
  21. llamaindex. 2023. Sentence window retriever.
  22. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  24. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849.
  25. Zackary Rackauckas. 2024. Rag-fusion: a new take on retrieval-augmented generation. arXiv preprint arXiv:2402.03367.
  26. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
  27. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  28. The computational limits of deep learning. arXiv preprint arXiv:2007.05558.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  30. Ai-generated content (aigc): A survey. arXiv preprint arXiv:2304.06632.
  31. Shitao Xiao and Zheng Liu. 2023. [link].
  32. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  33. Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  35. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jian Chen (257 papers)
  2. Peilin Zhou (34 papers)
  3. Yining Hua (23 papers)
  4. Yingxin Loh (1 paper)
  5. Kehui Chen (6 papers)
  6. Ziyuan Li (32 papers)
  7. Bing Zhu (53 papers)
  8. Junwei Liang (47 papers)
Citations (2)