Answerability in QA Systems
- The Answerability Challenge is defined as determining if a given context sufficiently contains the evidence to answer a query, clearly distinguishing between evidence support and topical relevance.
- It spans multiple domains—including text, image, video, code, and databases—and highlights challenges like retrieval misses, semantic overlaps, and modality-specific failures.
- Recent methods enhance answerability detection by employing improved supervision, refined architectures, and latent signal analysis to reduce hallucination and improve response quality.
Searching arXiv for papers on answerability across QA, VQA, video QA, CIS, and related retrieval-augmented settings. Search query: "all:answerability retrieval-augmented open-domain question answering" The Answerability Challenge is the problem of determining whether a query can be answered from a specified evidence source and, if not, preventing unsupported response generation. In contemporary research, the problem is formulated with respect to retrieved passages in open-domain question answering, images in visual question answering, instructional videos, API descriptions in retrieval-augmented code generation, enterprise databases, and schema-conditioned SQL generation. Across these settings, answerability is defined relative to the available context rather than to world knowledge in the abstract, and it is repeatedly distinguished from topical relevance: a context may be on-topic yet still fail to contain sufficient evidence for a correct answer (Abdumalikov et al., 2024, Łajewska et al., 2024, Gurari et al., 2018, Kim et al., 2024, Mandal et al., 27 Jan 2025).
1. Conceptual foundations
A formal notion of answerability predates current LLM-based systems. In the setting of result-bounded data interfaces, answerability asks whether there exists a plan that returns all answers to a query despite interfaces that may provide only bounded subsets of matching tuples; the analysis proceeds through reductions to query containment with constraints, schema simplification theorems, and linearization methods (Amarilli et al., 2017). In that tradition, answerability is not a property of a question alone, but of a question under access restrictions and integrity constraints.
Modern QA research adopts the same contextual view in probabilistic and neural systems. In conversational information seeking, a question is treated as answerable if the answer is present in the retrieved evidence and unanswerable otherwise; the distinction is explicitly not general relevance but the presence of the specific information needed to answer the question (Łajewska et al., 2024). In retrieval-augmented ODQA, the emphasis is on type I unanswerability, where retrieved context does not contain a sufficient answer, often because retrieval is inaccurate or the corpus lacks the needed information (Abdumalikov et al., 2024). In information-seeking QA over Natural Questions and TyDi QA, answerability is defined by whether the paired evidence document contains an answer, but the literature stresses that many unanswerable cases arise naturally from retrieval miss, incomplete evidence, ambiguity, false premises, or task-format restrictions rather than from adversarial construction (Asai et al., 2020).
A parallel but distinct use of the term appears in automatic question generation. There, answerability refers to whether a generated question contains enough of the right information for a human or downstream QA system to infer the intended answer. The proposed scoring function decomposes this into question type, named entities, relevant content words, and function words, and combines their weighted precision and recall into an answerability score that can be blended with BLEU-style metrics (Nema et al., 2018). This broader usage shows that answerability can refer either to support in the evidence or to adequacy of the query itself.
2. Task formulations across domains
Recent work treats answerability as a first-class prediction problem in multiple modalities and system architectures. The shared structure is a query, a bounded context, and a decision about whether the context supports answering, but the label spaces differ substantially.
| Domain | Context | Output space |
|---|---|---|
| Retrieval-augmented ODQA | Question + retrieved excerpt | Answer text or unanswerable |
| VizWiz | Image + spoken question | Answerable, Unsuitable Image, or Unanswerable |
| YTCommentQA | Question + instructional video | Unanswerable, visual, script, or both |
| RaCGEval | Query + retrieved API descriptions | Answerable, partially answerable, unanswerable |
| SCARE | Question + schema + candidate SQL | Preserve/correct SQL, ambiguous, unanswerable |
| DBRouting | Query + repository of databases | Rank databases by answerability |
In VizWiz, answerability is central because images are captured by blind photographers, questions are spoken, and many visual questions cannot be answered. Crowd workers explicitly distinguish “Unsuitable Image” from “Unanswerable”, and the dataset reports that 28.63% of visual questions are not answerable under the criterion that at least half of the 10 answers indicate “unanswerable” or “unsuitable image” (Gurari et al., 2018). In YTCommentQA, naturally generated YouTube questions are labeled not only by answerability but also by required modality: visual, script, or both, with 32.16% of questions answerable by both modalities individually and 7.46% labeled as requiring both visual and script together (Yang et al., 2024).
In conversational retrieval, answerability is sometimes modeled hierarchically. A CAsT-based benchmark provides labels at the sentence, passage, and ranking levels, with rankings built from all possible 3-passage subsets for a question (Łajewska et al., 2024). In retrieval-augmented code generation, the task becomes three-way—Answerable, Partially answerable, Unanswerable—because multi-part programming requests may be only partially supported by the retrieved API descriptions (Kim et al., 2024). In EHR text-to-SQL verification, SCARE further separates ambiguous from unanswerable and couples classification with SQL preservation or correction, so that ambiguous and unanswerable questions must not yield executable SQL (Lee et al., 13 Nov 2025).
A related extension appears in enterprise data access. DBRouting defines answerability as routing a natural-language question to the database that can provide the correct answer, learning a scoring function or over schemas and optional metadata, and evaluating top- routing with or (Mandal et al., 27 Jan 2025). Here answerability is a retrieval problem over data sources rather than a direct response-generation problem.
3. Why answerability is difficult
A recurrent empirical finding is that models often learn shallow correlates of answerability rather than evidence sufficiency. In retrieval-augmented ODQA, training with random question–passage negatives produces very strong performance on random unanswerable passages but fails catastrophically on semantically overlapping negatives: a T5-XL model achieves 98.0% abstention on random negatives yet only 1.1% on semantically related negatives, including passages related to the question topic or mentioning the correct answer entity in an irrelevant context (Abdumalikov et al., 2024). The proposed explanation combines heuristic learning—“abstain if the passage looks unrelated”—with confirmation bias toward answer entities.
Information-seeking QA exhibits a different difficulty profile. Controlled experiments on Natural Questions and TyDi QA show that once a model is given the gold paragraph and the correct answer type, performance rises sharply; for example, on NQ, ETC + Gold T + Gold P reaches 68.3 short-answer F1, and on TyDi QA mBERT + Gold T reaches 78.5 long-answer F1, only 1.4 F1 behind human on long answers (Asai et al., 2020). The bottlenecks are therefore paragraph selection and answerability prediction, not merely answer extraction. Manual annotation of 800 unanswerable examples across six languages further shows that many cases are retrieval misses, invalid questions, false premises, or invalid answers, confirming that “natural” unanswerability is structurally different from SQuAD 2.0-style synthetic negatives (Asai et al., 2020).
Multimodal settings add modality-specific failure modes. In YTCommentQA, even strong models struggle with segment-level and whole-video answerability classification; GPT-4 obtains only 33.02 F1 on segment answerability and 27.03% on five-way video answerability classification, with 53% of its segment-level errors coming from mistakenly labeling Visual Answerable cases as unanswerable and 85% of Combined Answerable instances misclassified in the video-level task (Yang et al., 2024). In VizWiz, the best simple answerability predictor is the combined Q+I classifier with AP = 0.717 and F1 = 0.648, and the strong contribution of image-only features indicates that answerability is often driven by image-quality failures rather than semantic mismatch (Gurari et al., 2018).
Code and SQL settings show that answerability remains difficult even when the evidence is explicitly structured. In RaCGEval, the best baseline—gemma-1.1-7b-it with fine-tuning—reaches only 46.7% accuracy on the three-way classification task (Kim et al., 2024). In SCARE, systems that aggressively produce or repair SQL achieve strong preservation on answerable questions but weak ambiguity detection; Single-Turn reaches PR = 97.9% yet only 31.6% ambiguous recall, whereas Two-Stage attains the highest unanswerable recall at 93.0% but sacrifices coverage on answerable questions (Lee et al., 13 Nov 2025). This establishes a persistent trade-off between utility and conservative abstention.
4. Methods for detecting unanswerability and controlling response behavior
One line of work improves supervision. In retrieval-augmented ODQA, adding SQuAD 2.0 unanswerable pairs to factual Natural Questions training data largely eliminates the random-negative generalization failure: performance rises from 1.1% abstention on semantically related negatives to 99.9% on “Related to Q” and 100.0% on “Related to A,” while maintaining 98.3% on random negatives (Abdumalikov et al., 2024). In conversational information seeking, a sentence-level BERT classifier with passage- and ranking-level aggregation provides a strong baseline, with the best CAsT-only configuration using max at passage level and mean at ranking level to reach 0.891 ranking accuracy (Łajewska et al., 2024).
Another line of work modifies the model architecture. Contextual Candor introduces Reinforced Unanswerability Learning (RUL), which combines a discriminative unanswerability head, hierarchical attention-based aggregation across sentence, paragraph, and ranking levels, supervised fine-tuning, and RLHF on refusal responses (Robinson et al., 1 Jun 2025). On the reported benchmark, RUL improves unanswerability detection to 0.840 sentence accuracy, 0.945 paragraph accuracy, and 0.910 ranking accuracy, while also increasing refusal rate on unanswerable queries to 0.920 and human-rated helpfulness/appropriateness to 4.6 on a 5-point scale (Robinson et al., 1 Jun 2025). In video, alignment-oriented approaches pursue a related objective. UVQA constructs unanswerable video questions by altering object, attribute, and relation descriptions; answerability alignment with SFT or DPO then raises Video-LLaVA from F1 0.00 to 0.68 and improves alignment score from 0.33 to 0.68 (Yoon et al., 7 Jul 2025).
A third line of work surfaces answerability signals already latent in LLMs. Hidden-state analysis shows that the final-layer representation of the first generated token linearly separates answerable from unanswerable cases across SQuAD 2.0, NQ, and MuSiQue, with probe F1 reaching 90.4 for Flan-UL2 on SQuAD (Slobodkin et al., 2023). Prompting with an explicit hint such as “If it cannot be answered based on the passage, reply ‘unanswerable’” can improve unanswerability detection dramatically; for Flan-UL2 on SQuAD, zero-shot F1 rises from 46.3 to 92.3 (Slobodkin et al., 2023). Beam relaxation further shows that abstaining responses are often present somewhere in the beam even when the top beam hallucinates, and erasing the answerability subspace sharply reduces both unanswerability F1 and QA performance, supporting a causal role for the latent signal (Slobodkin et al., 2023).
Retriever design also matters. Evidentiality-Aware Dense Passage Retrieval constructs synthetic distractors by removing evidence spans from gold passages and trains the retriever to rank evidence above distractors and distractors above unrelated negatives. This improves Top-1 retrieval on Natural Questions from 31.8 to 35.4 under vanilla training and yields downstream exact-match gains with DPR and FiD readers (Song et al., 2023). The central claim is that relevance supervision alone is insufficient for abstractive ODQA because the retriever must learn evidence-bearing passages rather than merely topically relevant passages.
5. Evaluation regimes and representative findings
Evaluation protocols vary because answerability is task-dependent. ODQA studies often measure abstention rate or percentage of correctly abstaining, supplemented in hard settings by the proportions of extracted, hallucinated, and abstained responses (Abdumalikov et al., 2024). Conversational retrieval work reports classification accuracy at sentence, passage, and ranking levels (Łajewska et al., 2024). VizWiz uses precision-recall curves, average precision, and F1 for predicting whether a visual question is not answerable (Gurari et al., 2018). YTCommentQA separates binary segment answerability, evaluated with F1, from five-way video answerability and modality classification, evaluated with accuracy (Yang et al., 2024). RaCGEval evaluates three-way classification accuracy and studies downstream code-generation effects with pass@k and coverage (Kim et al., 2024). DBRouting reports Recall@1, Recall@3, and mAP over databases (Mandal et al., 27 Jan 2025). SCARE uses Preservation Rate, Coverage, Correction Rate, and label precision/recall/F1 for ambiguous and unanswerable classes (Lee et al., 13 Nov 2025).
The empirical record is uniform in one respect: answerability is rarely solved by direct transfer from adjacent tasks. In VizWiz, a prior relevance model reaches only AP = 0.306, whereas training or fine-tuning on VizWiz raises performance to AP = 0.561 and 0.605, and multimodal Q+I remains clearly best at 0.717 (Gurari et al., 2018). In CAsT-based conversational search, a trained classifier-and-aggregation pipeline outperforms ChatGPT at the ranking level: the best CAsT-only model reaches 0.891 ranking accuracy, versus 0.669 for zero-shot and 0.601 for two-shot ChatGPT (Łajewska et al., 2024). In DBRouting, Llama3 70B performs best on Spider-Route, but token-length limitations prevent analogous ranking on Bird-Route, while task-specific SBERT embeddings substantially improve over pre-trained embeddings and approach LLM performance in some settings (Mandal et al., 27 Jan 2025).
Human studies expose an additional asymmetry between factual support and perceived quality. In conversational information seeking, controlled manipulations of factual correctness and source presence/validity have no statistically significant effect on user ratings in the answerability study, whereas diversity manipulations in the viewpoints study significantly affect diversity, transparency, balance, and overall satisfaction (Łajewska et al., 2024). The same study reports Pearson correlations of overall satisfaction with 0.634 for factual correctness, 0.660 for confidence in answer accuracy, 0.720 for diversity, 0.727 for transparency, and 0.785 for balance, leading to the conclusion that response incompleteness is easier for users to detect than answerability failures and that user satisfaction is mostly associated with response diversity rather than factual correctness (Łajewska et al., 2024).
6. Open issues and research directions
A central unresolved issue is granularity. Several papers explicitly note that binary answerable/unanswerable labels are simplifications. Conversational search work suggests a future ordinal formulation such as unanswerable / partially answerable / fully answerable (Łajewska et al., 2024). Retrieval-augmented code generation already operationalizes partially answerable as a separate class for multi-part requests (Kim et al., 2024). SCARE distinguishes ambiguous from unanswerable, and its error analysis shows that vague questions, vague words, and ambiguous references are much harder to detect than obvious small-talk or missing-column cases (Lee et al., 13 Nov 2025). These results suggest that answerability is often entangled with clarification, decomposition, and scope management rather than mere refusal.
Another open issue concerns evidence composition. YTCommentQA identifies questions answerable only when visual and script information are combined (Yang et al., 2024). Financial IR benchmark construction imposes a multi-document dependency criterion, requiring that a candidate multi-document query be more answerable with the combined passages than with any individual document alone (Kim et al., 7 Nov 2025). In long-form QA, questions generated from abstractive summaries are harder than passage-generated questions: QG-Summary requires multiple passes through the passage in 31% of cases versus 24.4% for QG-Passage, and open-source models degrade especially for contexts longer than 1024 tokens (Bhat et al., 2023). This suggests that answerability is increasingly a property of distributed evidence and long-context reasoning.
A further direction is interactive recovery from unanswerability. Agentic RAG research proposes suggesting nearby answerable queries rather than merely rejecting the original one. It distinguishes no workflow, no knowledge, and answerable queries, templates unanswerable questions to workflow-level forms, retrieves answerability-labeled examples with robust dynamic few-shot learning, and reports improved similarity and answerability over static few-shot and retrieval-only baselines on three real-world datasets (Spaeh et al., 13 Jan 2026). Such work reframes answerability from a binary safety filter into a mechanism for completing user interaction.
Taken together, the literature defines the Answerability Challenge as a general problem of evidence-bounded inference: systems must know whether the current context supports an answer, separate support from relevance, refuse or clarify when necessary, and do so across text, images, video, code, databases, and multimodal workflows. The dominant empirical lesson is consistent across domains: when answerability is ignored, systems hallucinate; when it is modeled explicitly, both reliability and interpretability improve (Abdumalikov et al., 2024, Łajewska et al., 2024, Yoon et al., 7 Jul 2025).