LLM-based Answerability
- LLM-based answerability is defined as the capability of a model to provide verifiable, context-dependent answers without external inference.
- Reasoning-augmented scoring and hierarchical aggregation methods are used to assess and validate answer quality in both single and multi-document scenarios.
- Empirical benchmarks and domain-specific protocols help reduce false positives and align model responses with human judgment.
LLM-based answerability refers to a suite of formal, algorithmic, and evaluation methodologies designed to determine the precise circumstances under which a LLM can or should return a reliable, context-grounded answer to a given query. This notion is strictly distinguished from related IR or QA metrics such as "relevance," "1" or "confidence," and typically incorporates structural constraints, reasoning enhancements, and explicit abstention mechanisms to ensure alignment with human judgment and domain-specific requirements.
1. Formal Definitions: Answerability vs. Related Notions
Answerability is the property that a question can be fully and verifiably answered using a specified context, with no reliance on parametric model knowledge, external facts, or unsupported inference. In the context of information retrieval and domain-specific benchmarks, especially in financial, legal, and scientific domains, answerability is operationalized as follows (Kim et al., 7 Nov 2025, Kim et al., 2024):
- Relevance: Subject-matter overlap between query and document.
- Retrievability: An IR model can fetch documents bearing on the query.
- Answerability: The context contains all information needed to answer the query, with no external knowledge or unsupported reasoning required. In multi-document settings, answerability is valid only if reasoning over the union of all provided contexts is necessary; no single document alone suffices.
Formally, for query and context set , a reasoning-augmented answerability function is used to assign a score, with a threshold ensuring strict acceptance criteria (Kim et al., 7 Nov 2025):
- For single-doc:
- For multi-doc: and
2. LLM-based Answerability Assessment Protocols
2.1 Reasoning-Augmented Scoring
In high-stakes or complex domains, black-box similarity measures are insufficient. Instead, reasoning-augmented LLMs are employed to issue both a chain-of-thought analysis and an explicit answerability score. The DeepSeek-14B ThinkEval model is a key example, using a prompt that requires explicit “Think” steps (identification of request, fact localization, sufficiency judgment) before outputting a numeric score (Kim et al., 7 Nov 2025).
Such protocols yield:
- Higher correlation with human judgment (Pearson up to 0.78 and Kendall up to 0.57).
- Low false-positive rates for strictly answerable queries.
- Support for conditional acceptance criteria in multi-context evaluation.
2.2 Aggregation and Hierarchical Labeling
Detection of answerability is boosted by multi-level aggregation (sentence, paragraph, ranking). Classifiers in (Łajewska et al., 2024) and hierarchical attention heads in (Robinson et al., 1 Jun 2025) operate at all levels, increasing sensitivity to partial vs. complete answer scenarios.
3. Algorithmic and Mathematical Formulations
3.1 Score Computation and Filtering
| Stage | Mathematical Formulation | Source |
|---|---|---|
| Reasoning-augmented scoring | (Kim et al., 7 Nov 2025) | |
| Multi-doc dependency | (Kim et al., 7 Nov 2025) | |
| Thresholding | Accept if (typically ) | (Kim et al., 7 Nov 2025) |
| Hierarchical aggregation | , etc. | (Łajewska et al., 2024) |
| Linear direction projection | (activation projection) | (Lavi et al., 26 Sep 2025) |
Complexity is often managed by such hierarchical or projection-based scoring to reduce false positives and improve cross-document reasoning alignment.
4. Benchmarking and Evaluation
4.1 Empirical Benchmarks
Answerability must be anchored in annotated benchmarks:
- Financial IR: KoBankIR (multi-document Korean banking corpus) includes both simple and merged queries, validated rigorously via reasoning-augmented scoring (Kim et al., 7 Nov 2025).
- Conversational QA: CAsT-Answerability, ECA datasets support hierarchical sentence/paragraph/ranking labels, with robust agreement () (Robinson et al., 1 Jun 2025, Łajewska et al., 2024).
- Code Gen: RaCGEval provides 3-way (answerable/partially/unanswerable) labels over retrieval-augmented code queries (Kim et al., 2024).
4.2 Human and Automatic Judgments
Correlations between LLM-based answerability metrics and human annotators are necessary for validation, with scores revealing that prompt-structured, CoT-enhanced LLMs exhibit strongest alignment (Kim et al., 7 Nov 2025, Robinson et al., 1 Jun 2025).
4.3 Quiz-based and Scenario-driven Testing
For generative tasks (survey writing, long-form QA), answerability is tested via quiz-driven evaluations:
- SurveyBench deploys quiz-based win-rates and reference-anchored quality scores to directly probe whether a generated text supports real reader queries (Sun et al., 3 Oct 2025).
- LFQA settings rate models on dimensions of coherence, relevance, factual consistency, and accuracy, distinguishing model ability to respond to summary-derived (harder) queries (Bhat et al., 2023).
5. Failure Modes, Limits, and Abstention
5.1 LLM Instability and Answerability
Empirical studies on legal QA reveal that state-of-the-art LLMs remain fundamentally unstable (answer “flipping” in 50% of hard cases even at ), especially when legal standards are open-ended or fact-intensive (Blair-Stanek et al., 28 Jan 2025). Stability is measured as the proportion of repeated responses yielding the modal answer: High instability rates signal intrinsic limitations of LLM answerability in domains with high ambiguity or insufficiently constrained queries.
5.2 Abstention and Unanswerability Detection
Modern work proposes explicit "abstention ability" as an answerability safeguard: LLMs should withhold response (“I don't know”) when context is insufficient (Madhusudhan et al., 2024). Black-box evaluation using the Answerable–Unanswerable Confusion Matrix formalizes abstention outcomes (true positive/negative etc.), with Chain-of-Thought prompting markedly increasing correct abstention rates.
Linear projection techniques in activation space further enable robust, dataset-agnostic unanswerability detection and steerable refusal behavior (Lavi et al., 26 Sep 2025).
6. Architectural and Practical Considerations
6.1 Reasoning-Enhanced Models
Incorporating explicit chain-of-thought steps and multi-level aggregation enables richer and more faithful detection of answerable/unanswerable queries, especially in conversational and document-level QA settings (Robinson et al., 1 Jun 2025, Łajewska et al., 2024).
6.2 Multi-agent and Symbolic Workflows
Legal AI systems such as L4M integrate dual LLM agents with SMT solvers to ensure that only queries with a formally satisfiable extraction of facts, statutes, and logical relationships are deemed answerable (Chen et al., 26 Nov 2025). Similar principles apply to multi-agent retrieval reasoners in legal QA (Wang et al., 31 Aug 2025) and embedding-based answerability graphs in opinion mining (Fukuma et al., 2024).
6.3 Domain and Modality Sensitivity
Answerability scoring and detection require domain-specific calibration, especially in legal, medical, scientific, and code domains. Transfer of techniques (e.g., reasoning-augmented LLM evaluators, chain-prompt designs) across domains demands prompt engineering, retraining, and possibly new aggregation templates (Kim et al., 7 Nov 2025, Kim et al., 2024).
7. Open Challenges and Future Directions
Limitations persist in:
- Generalization to multi-modal contexts (e.g., tables, PDFs).
- Scaling to low-resource or cross-lingual scenarios.
- Fully bridging the human–machine gap, as seen in quiz-based answerability (e.g., LLM surveys underperform humans by 21% on average (Sun et al., 3 Oct 2025)).
- Achieving reliable, robust abstention and uncertainty calibration in black-box LLM settings.
Ongoing work targets development of more sophisticated prompt-based methods, reinforcement learning with answerability-oriented reward, and improved dataset construction for fine-grained answerability annotation and hierarchical validation.
References
- Reasoning-augmented answerability assessment and benchmark construction: (Kim et al., 7 Nov 2025)
- LLM instability and legal answerability metrics: (Blair-Stanek et al., 28 Jan 2025)
- Answerability in retrieval-augmented code generation: (Kim et al., 2024)
- Hierarchical unanswerability detection and trustworthiness: (Robinson et al., 1 Jun 2025)
- Sentence-level and passage-level answerability in conversational QA: (Łajewska et al., 2024)
- Answerability in long-form QA: (Bhat et al., 2023)
- Multi-agent, SMT-backed answerability in legal reasoning: (Chen et al., 26 Nov 2025)
- Quiz-based answerability for survey generation: (Sun et al., 3 Oct 2025)
- Linear direction methods for unanswerability detection: (Lavi et al., 26 Sep 2025)
- Prompt engineering and answerability for legal compliance: (Hannah et al., 2024)
- Abstention ability and confusion-matrix driven assessment: (Madhusudhan et al., 2024)
- Logical-structure and semantic integration models for legal QA: (Yao et al., 11 Feb 2025)
- QA graph-based answerability models for opinion mining: (Fukuma et al., 2024)