Expect the Unexpected: FailSafe Long Context QA for Finance (2502.06329v1)

Published 10 Feb 2025 in cs.CL

Abstract: We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA

Summary

The paper introduces FailSafe QA, a benchmark and evaluation framework to measure large language model robustness against query perturbations and context grounding abilities in finance.
Evaluation of 24 models reveals a significant trade-off between maintaining answer accuracy under benign query perturbations and avoiding hallucinations when context is degraded or missing.
Results show models struggle with context grounding, fabricating information in up to 41% of scenarios with poor context, highlighting the need for balanced design strategies.

The paper introduces a long-context benchmark for the financial domain that systematically stresses the resilience and reliability of LLMs under realistic input perturbations. The framework, referred to as FailSafe QA, poses a dual challenge: ensuring that an LLM provides robust answers in the face of adversarial query modifications while simultaneously guarding against hallucinations when the supporting context is weakened or absent.

The methodology comprises several key components:

Dataset Construction:
- The dataset is derived from SEC 10-K filings, with contexts truncated to maintain coherent paragraphs and kept under a 25k token limit.
- A semi-automated pipeline is used, involving multi-turn query generation, filtering, and rewriting steps to produce clear, standalone questions.
- Further processing introduces perturbations both at the query level (Misspelled, Incomplete, and Out-of-Domain queries) and at the context level (Missing context, OCR errors simulation, and the introduction of irrelevant context).
- The final dataset contains 220 examples—and for each example, multiple variants of the question are paired with the corresponding context and ground truth citations.
Evaluation Metrics:
- Answer Relevance and Compliance: The paper defines a categorical scale (1–6) to assess whether model responses are factually correct and comprehensive. A binary mapping, $c_{\geq4}$ , allows averaging across examples to yield an overall compliance ratio.
- LLM Robustness (R): Defined as
- $R = \frac{1}{n}\sum_{i=1}^{n} \min_{j} c_{\geq4}(\text{model}(T_{j}(x_{i})), y_{i})$
- where $T_j$ represents the identity or one of the perturbation transformations. This metric effectively captures performance drops across test cases with varied query and context modifications.
- Context Grounding (G): Measures the model’s ability to refrain from generating hallucinations in the presence of degraded or irrelevant context through averaging compliance over scenarios where critical context is missing.
- $G = \frac{1}{2n}\sum_{j=1}^{2}\sum_{i=1}^{n} c_{\geq4}(\text{model}(T_j(x_{i}), Y))$
- LLM Compliance Score ( $\text{LLMC}_\beta$ ): Inspired by the precision-recall trade-off, this metric combines Robustness and Context Grounding while allowing tuning via a factor $\beta$ to prioritize safe refusals over possibly hallucinated facts.
Experimental Evaluation:
- A wide array of 24 models—spanning open-source (DeepSeek, Meta Llama variants, Qwen, Phi series) and proprietary APIs (OpenAI GPT-4, OpenAI o1/o3-mini, Gemini models, etc.)—are rigorously evaluated.
- Results indicate that while some models (e.g., OpenAI o3-mini) maintain high robustness across query perturbations (with baseline robustness approaching 0.98 but dropping by up to 0.08 under perturbations), they exhibit a tendency to fabricate information in challenging context grounding tasks.
- Notably, Palmyra-Fin-128k-Instruct, which is highlighted as the most compliant model, demonstrated robust baseline performance yet failed to maintain robust predictions in 17% of test cases. In contrast, reasoning models, including OpenAI o3-mini and DeepSeek variants, fabricated information in up to 41% of the scenarios under conditions that mimic missing or irrelevant context.
Insights and Trade-offs:
- The experimental analysis reveals a significant trade-off between robustness (maintaining answer accuracy under benign perturbations) and context grounding (refraining from hallucinating when the context is degraded).
- Text generation tasks (e.g., blog post style queries) are particularly susceptible to hallucinations compared to straightforward question-answer scenarios.
- The paper advocates for integrating multi-stage approaches for context-dependent content generation, such as extracting correct information with dedicated QA queries before composing extended content.
Reproducibility and Future Work:
- The benchmark, along with all prompts and inference parameters, is made available publicly, facilitating reproducibility and further research on long-context financial LLM applications.
- The authors acknowledge the scope limitations inherent in focusing on financial sentiment and long-context processing, while indicating that the framework can be readily adapted to other domains and context lengths.
- Moreover, the negative correlation observed between Robustness and Context Grounding prompts further investigation into design strategies for balancing resilient responses with safe refusals in high-stakes applications.

Overall, the paper provides a comprehensive evaluation framework that quantifies both the ability of LLMs to withstand semantic perturbations in user queries and to avoid hallucinations when deprived of reliable context. The quantitative evaluation—in particular, the introduction of the LLM Compliance Score—offers nuanced insights into the reliability of LLMs for financial applications, where factual accuracy is critical.