Question Answerability Classification (AC)

Updated 13 November 2025

Question Answerability Classification is a task that determines whether a question is answerable from given evidence by applying binary or multi-class prediction.
Approaches range from classical feature-based models and similarity methods to neural transformers and LLM activation-space analyses, yielding notable accuracy and generalization improvements.
Applications span Q&A platform moderation, expert routing, knowledge base evaluation, and LLM hallucination control in diverse multimodal and conversational settings.

Question Answerability Classification (AC) is a task that formalizes the prediction of whether a given question, within the context of a particular system or evidence set, is likely to be (fully) answerable. Variants arise across community Q&A, conversational QA, information-seeking dialogue, document analysis, open-domain, and multimodal video settings. While problem statements, datasets, and forms of context differ, the core is a binary (or multi-class) supervised classification: given question q and evidence e (which may be a passage, corpus, video, or even just the question’s linguistic content), predict $y \in \{0,1\}$ (answerable/unanswerable). AC underpins moderation, question routing, data quality assessment, evaluation of knowledge bases, and hallucination avoidance in LLMs.

1. Formal Definitions and Problem Formulations

The mathematical formulation of AC is context-dependent:

Community Q&A: Given $q \in Q$ and an observation window $t$ , predict $y \in \{\text{Answered}, \text{Open}\}$ , where $y=1$ if $q$ receives at least one answer within $t$ (Quora) (Maity et al., 2017).
Conversational QA: Given $(q, c)$ (question and context sentence), predict $y=1$ iff $q$ is answerable from $c$ (CoQA/QNLI sentence-level), extended hierarchically to passage and conversation histories (Hwang et al., 2022).
Information-Seeking Dialogue: Classifier $f(q,s)$ predicts at sentence level if a candidate snippet $s$ contains (part of) the answer. Aggregations produce passage-level and ranking-level $y$ via max/mean (Łajewska et al., 2024).
LLMs: For LLMs, AC reduces to computing whether the context $c$ supports an answer for $q$ , approximately via activation-level scoring (Lavi et al., 26 Sep 2025): with hidden state $h^*=\mathbf{h}_{\ell^*,p^*}$ , define

$s(c,q) = \langle d, h^* \rangle$

and predict $y=1$ iff $s(c,q) > \tau$ for threshold $\tau$ learned on dev data.

A table summarizing representative AC task settings:

Domain	Input $(q, \cdot)$	Context Used	Label $y$	Notable Formulation
Community Q&A	$q$	linguistic, meta	Answered / Open	SVM on Quora, (Maity et al., 2017)
Info-Seeking Search	$(q, P)$	text corpus	Answerable / Unanswerable	BERT, passage-sentence aggregation (Łajewska et al., 2024)
LLMs (Extractive QA)	$(q, c)$	passage	Answerable / Unanswerable	Directional activation scoring (Lavi et al., 26 Sep 2025)
Video QA	$(q, V)$	video, script	Multiclass (5-way)	Llama2/SeViLA, timestamp anchoring (Yang et al., 2024)
QA-based SCR	$(q, d)$	document	Answerable / Unanswerable	LLM-prompted pipeline (Aperstein et al., 10 Sep 2025)

2. Methodologies: Modeling, Features, Architectures

Approaches to AC span classical classifiers, deep neural networks, LLM prompts, and activation-space analysis:

Feature-Based Supervised Learning: Early work, e.g. on Quora (Maity et al., 2017), utilizes SVMs and engineered feature vectors. Features include:
- Surface (length, OOV rate, $n$ -gram presence)
- Syntax (POS tag diversity: $\mathrm{POSDiv}(q)=-\sum_j p_j\log p_j$ )
- Topic modeling (LDA topic membership & diversity: $\mathrm{TopicDiv}(q)=-\sum_k p_k\log p_k$ )
- Edit/readability (ROUGE-LCS recall from original to edited question)
- Psycholinguistics (LIWC category fractions)
Similarity/Nearest-Neighbor Models: For detection of unclear or unanswerable questions, retrieval-based features from similar historical questions are used. Key features include BM25 similarity sums, code block presence, length, and cosine similarity between clarification keyphrases (Trienes et al., 2019).
Neural Architectures:
- Contextual Transformers: Sentence-level and passage-level AC via fine-tuned BERT/ALBERT encoders, using joint inputs $[CLS]q[SEP]c[SEP]$ and a classification head (Hwang et al., 2022, Łajewska et al., 2024).
- Multimodal Fusion: Video QA employs vision encoders (ViT-based or SeViLA) fused with transcript encoders; LLMs are augmented for 16k context tokens via rotary embeddings and FlashAttention (Yang et al., 2024).
- LLM Black-Box Evaluation: Zero-shot or prompt-based answerability verdicts (e.g., does the model extract an answer, or does a “JUDGE” LLM assert semantic equivalence) (Wang et al., 2023, Aperstein et al., 10 Sep 2025).
Activation-Space Methods (LLMs): Identification of a linear direction $d$ where

$v_{\ell,p} = \text{mean}(h^\mathrm{un}_{\ell,p}) - \text{mean}(h^\mathrm{ans}_{\ell,p})$

and ranking direction candidates by their ability to steer model output toward abstention (measured via the log-odds of an “unanswerable” token) (Lavi et al., 26 Sep 2025).

3. Datasets and Labeling Strategies

Datasets underpinning AC research are characterized by their granularity, label rigor, and connection to downstream tasks:

Community Q&A

Quora: 822,040 questions, 1.8M answers, labels based on time-to-first-answer (1 or 3 months) (Maity et al., 2017).
Stack Exchange: Multiple sites (SO, SuperUser, AskUbuntu, etc.) with clear/unclear labels inferred by comment and edit heuristics (Trienes et al., 2019).

Information-Seeking and Conversational Search

CAsT-Answerability: Sentence, passage, and ranking-level annotations derived from TREC CAsT, based on nugget overlap, passage inclusion, and combinatorial ranking (Łajewska et al., 2024).
CoQA/QNLI-based: AC modules trained on sentence-level labeled Q–A pairs, incorporating both single-turn and multi-turn conversational data (Hwang et al., 2022).

LLM-Based and Document-Level

Synthetic SQuAD Variants: For SCR, answerability is controlled by surgical information removal across paraphrased documents, yielding $(q,d)$ pairs with guaranteed label correctness (Aperstein et al., 10 Sep 2025).
RepLiQA, NQ, MuSiQue, SQuAD2.0: Balanced sets of answerable/unanswerable $(q,c)$ pairs for LLM activation-analysis studies (Lavi et al., 26 Sep 2025).

Multimodal/Video QA

YTCommentQA: 2,332 user-generated questions across 2,004 YouTube videos, labeled by human annotators after timestamp verification, script/visual examination, and consensus (Yang et al., 2024).

4. Evaluation Metrics and Results

Performance is measured using standard classification metrics; implementation choices impact interpretability and generalization:

Community Q&A Models (Maity et al., 2017):
- SVM (1 month): Accuracy = 76.26%, Precision = 0.763, ROC AUC ≈ 0.762.
- Linguistic features alone: Accuracy = 74.18%.
Unclear/Answerability Entry Points (Trienes et al., 2019, Łajewska et al., 2024):
- F1 (unclear/unanswerable class): CNN/BOW-LogReg ≈ 0.79.
- BERT (sentence, passage, ranking): 0.752, 0.634 (max agg), 0.891 (ranking mean agg).
LLM Directions and Baselines (Lavi et al., 26 Sep 2025):
- Unanswerable recall (same dataset): classifier ≈ 87%, direction ≈ 83–86%.
- Cross-dataset F1: direction drops only ~7.4%, classifier ~30.2%.
Synthetic QA Evaluation (Wang et al., 2023, Aperstein et al., 10 Sep 2025):
- PMAN (automatic metric): Accuracy for manually vetted not-yes/no: 0.94, F1 (answer/unanswerable) ≈ 0.91–0.94.
- SCR AC datasets ensure 100% label fidelity by construction.
Video QA (Yang et al., 2024):
- Binary segment-level F1: Llama-2 (13B): 55.49, SeViLA: 46.55.
- Video-level multiclass accuracy: Llama-2 (13B): 37.70%, SeViLA: 35.27%; basic heuristics perform substantially worse.

A table of select results:

Setting	Method/Baseline	Top Metric(s)	Value
Quora (1 mo)	SVM (full features)	Accuracy	76.26%
Quora (1 mo)	Linguistic only	Accuracy	74.18%
StackEx (unclear det.)	BoW LR / CNN	F1 (unclear)	≈0.79
CAsT (ranking AC)	BERT (max/mean agg.)	Accuracy	0.891
LLM Direction (cross-ds)	Direction (calibrated)	Macro-F1 / Recall	up to 11.9% over classifier
YTCommentQA (binary seg)	Llama-2 (13B)	F1	55.49

5. Feature Importance and Model Analysis

Comprehensive feature ablation and coefficient analyses across multiple works reveal:

Language Use: LIWC psycholinguistic features, POS tag diversity ( $\mathrm{POSDiv}$ ), and ROUGE-LCS all strongly correlate with answerability (Maity et al., 2017).
Structural/Behavioral Factors: Edit counts, promotion behavior, and topic assignment changes are informative (Maity et al., 2017).
Similarity-derived Cues: The need for clarifying information among similar questions (BM25+keyphrase statistics) robustly predicts unclear/unanswerable status (Trienes et al., 2019).
Neural/Activation: Linear directions in LLM activation space allow identification of an “unanswerable” subspace without retraining, capturing abstention cues generalizable across datasets (Lavi et al., 26 Sep 2025).

Common failure modes include domain drift, low generalization under prompt-only or classifier-overfit approaches, surface question-type biases (attenuated in video QA), and challenges in modeling “partially” vs. “fully” answerable queries.

6. Applications and System Integration

AC modules are critical in applied and research settings:

Q&A Platforms: Early screening for moderation, expert routing, and real-time improvement suggestions (Maity et al., 2017).
Question Generation Evaluation: PMAN metric operationalizes answerability assessment for generated questions, correlating perfectly with human rank judgments across multiple models (Wang et al., 2023).
Conversational Agents: Synthetic AC modules ensure that generated question–answer pairs in synthetic CQA data accurately reflect answerability, including triaging for “unknown” (Hwang et al., 2022).
Semantic Document Analysis: AC serves as the measurement instrument for semantic coverage relations (equivalence, inclusion, overlap) between document pairs (Aperstein et al., 10 Sep 2025).
LLM Hallucination Control: Activation-direction approaches provide knobs to steer abstention, mitigating overconfident but unsupported generations (Lavi et al., 26 Sep 2025).
Video QA: Automated AC is foundational for full-pipeline response generation under significant multi-modality and long-context constraints (Yang et al., 2024).

7. Open Challenges and Future Directions

Research on AC consistently notes unresolved issues:

Partial/Graded Answerability: Most work frames AC as binary even though degrees of answerability (full, partial, none) are prevalent (Łajewska et al., 2024).
Multimodal Reasoning: Accurate integration of vision and script (including alignment, summarization, and cross-modal entailment) remains insufficiently addressed (Yang et al., 2024).
Robustness and Generalization: The direction-based approach (Lavi et al., 26 Sep 2025) generalizes better than discriminative classifiers, but further improvements (e.g., multi-layer, multi-position fusion) are open.
Efficient, Faithful Evaluation: Black-box LLM metrics like PMAN (Wang et al., 2023) and zero-shot QA pipelines (Aperstein et al., 10 Sep 2025) are effective but expensive and sensitive to model/version drift.
Explainability and Feedback: Most AC models predict $y$ but do not surface the minimal sufficient context or recommend clarifications; UI integration remains underexplored (Trienes et al., 2019).
Dataset Limitations: Many gold or synthetic datasets are bounded by annotation or LLM accuracy ceilings—for document-level SCR, only 8.2% of contexts support 100% reliable AC (Aperstein et al., 10 Sep 2025).

A plausible implication is that further integration of external knowledge bases, memory-augmented or hierarchical architectures, and richer supervision (multi-label, evidence-citing, uncertainty-aware) will be required for robust, general AC in dynamic or open-ended settings. Cross-domain and cross-modal transfer, as well as explainable, interactive AC, remain active research directions.