FinTextQA: A Dataset for Long-form Financial Question Answering (2405.09980v1)
Abstract: Accurate evaluation of financial question answering (QA) systems necessitates a comprehensive dataset encompassing diverse question types and contexts. However, current financial QA datasets lack scope diversity and question complexity. This work introduces FinTextQA, a novel dataset for long-form question answering (LFQA) in finance. FinTextQA comprises 1,262 high-quality, source-attributed QA pairs extracted and selected from finance textbooks and government agency websites.Moreover, we developed a Retrieval-Augmented Generation (RAG)-based LFQA system, comprising an embedder, retriever, reranker, and generator. A multi-faceted evaluation approach, including human ranking, automatic metrics, and GPT-4 scoring, was employed to benchmark the performance of different LFQA system configurations under heightened noisy conditions. The results indicate that: (1) Among all compared generators, Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy score; (2) The most effective system configuration on our dataset involved setting the embedder, retriever, reranker, and generator as Ada2, Automated Merged Retrieval, Bge-Reranker-Base, and Baichuan2-7B, respectively; (3) models are less susceptible to noise after the length of contexts reaching a specific threshold.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Younes Belkada. 2023. Bitsandbytes.
- Wikihowqa: A comprehensive benchmark for multi-document non-factoid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314.
- Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122.
- Lift yourself up: Retrieval-augmented text generation with self memory. arXiv preprint arXiv:2305.02437.
- A historical perspective of explainable artificial intelligence. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(1):e1391.
- Andrei Fajardo. 2023. Llm reranker demonstration (great gatsby).
- Eli5: Long form question answering. arXiv preprint arXiv:1907.09190.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Prewome: Exploiting presuppositions as working memory for long form question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8312–8322.
- Ivan Ilin. 2023. Advanced rag techniques: an illustrated overview.
- Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.
- Sylvain Gugger Joao Gante. 2022. [link].
- Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Jerry Liu. 2023. Auto merging retriever.
- llamaindex. 2023. Sentence window retriever.
- Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849.
- Zackary Rackauckas. 2024. Rag-fusion: a new take on retrieval-augmented generation. arXiv preprint arXiv:2402.03367.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- The computational limits of deep learning. arXiv preprint arXiv:2007.05558.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Ai-generated content (aigc): A survey. arXiv preprint arXiv:2304.06632.
- Shitao Xiao and Zheng Liu. 2023. [link].
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624.
- Jian Chen (257 papers)
- Peilin Zhou (34 papers)
- Yining Hua (23 papers)
- Yingxin Loh (1 paper)
- Kehui Chen (6 papers)
- Ziyuan Li (32 papers)
- Bing Zhu (53 papers)
- Junwei Liang (47 papers)