DocFinQA: A Long-Context Financial Reasoning Dataset (2401.06915v2)
Abstract: For LLMs to be effective in the financial domain -- where each decision can have a significant impact -- it is necessary to investigate realistic tasks and data. Financial professionals often interact with documents that are hundreds of pages long, but most financial research datasets only deal with short excerpts from these documents. To address this, we introduce a long-document financial QA task. We augment 7,437 questions from the existing FinQA dataset with the full-document context, extending the average context length from under 700 words in FinQA to 123k words in DocFinQA. We conduct extensive experiments over retrieval-based QA pipelines and long-context LLMs. DocFinQA proves a significant challenge for even state-of-the-art systems. We also provide a case-study on the longest documents in DocFinQA and find that models particularly struggle on these documents. Addressing these challenges may have a wide reaching impact across applications where specificity and long-range contexts are critical, like gene sequences and legal document contract analysis.
- The FinSBD-2019 shared task: Sentence boundary detection in PDF noisy text in the financial domain. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, pages 74–80, Macao, China.
- LongFormer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- The financial document structure extraction shared task (FinToc 2020). In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 13–22, Barcelona, Spain (Online). COLING.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- FlashAttention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Min-Yuh Day and Chia-Chou Lee. 2016. Deep learning for financial sentiment analysis on finance news providers. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 1127–1134. IEEE.
- SearchQA: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
- Automated rationale generation: a technique for explainable ai and its effects on human perceptions. Proceedings of the 24th International Conference on Intelligent User Interfaces.
- New and improved embedding model.
- Answer generation for retrieval-based question answering systems. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4276–4282, Online. Association for Computational Linguistics.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- MultiFin: A dataset for multilingual financial NLP. In Findings of the Association for Computational Linguistics: EACL 2023, pages 894–909, Dubrovnik, Croatia. Association for Computational Linguistics.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- Bizbench: A quantitative reasoning benchmark for business and finance. arXiv preprint arXiv:2311.06602.
- QED: A framework and dataset for explanations in question answering. Transactions of the Association for Computational Linguistics, 9:790–806.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics.
- WWW’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- The financial causality extraction shared task (FinCausal 2022). In Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pages 105–107, Marseille, France. European Language Resources Association.
- Corentin Masson and Syrielle Montariol. 2020. Detecting omissions of risk factors in company annual reports. In Proceedings of the Second Workshop on Financial Technology and Natural Language Processing, pages 15–21, Kyoto, Japan. -.
- QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
- Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of the 2022 International Conference on Learning Representations (ICLR).
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Okapi at TREC-3. Nist Special Publication Sp, 109:109.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
- Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829.
- DCFEE: A document-level Chinese financial event extraction system based on automatically labeled training data. In Proceedings of ACL 2018, System Demonstrations, pages 50–55, Melbourne, Australia. Association for Computational Linguistics.
- Extractive is not faithful: An investigation of broad unfaithfulness problems in extractive summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2153–2174, Toronto, Canada. Association for Computational Linguistics.
- TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287, Online. Association for Computational Linguistics.
- Varshini Reddy (12 papers)
- Rik Koncel-Kedziorski (19 papers)
- Viet Dac Lai (25 papers)
- Chris Tanner (18 papers)
- Michael Krumdick (10 papers)
- Charles Lovering (13 papers)