Question Answerability Classification (AC)
- Question Answerability Classification is a task that determines whether a question is answerable from given evidence by applying binary or multi-class prediction.
- Approaches range from classical feature-based models and similarity methods to neural transformers and LLM activation-space analyses, yielding notable accuracy and generalization improvements.
- Applications span Q&A platform moderation, expert routing, knowledge base evaluation, and LLM hallucination control in diverse multimodal and conversational settings.
Question Answerability Classification (AC) is a task that formalizes the prediction of whether a given question, within the context of a particular system or evidence set, is likely to be (fully) answerable. Variants arise across community Q&A, conversational QA, information-seeking dialogue, document analysis, open-domain, and multimodal video settings. While problem statements, datasets, and forms of context differ, the core is a binary (or multi-class) supervised classification: given question q and evidence e (which may be a passage, corpus, video, or even just the question’s linguistic content), predict (answerable/unanswerable). AC underpins moderation, question routing, data quality assessment, evaluation of knowledge bases, and hallucination avoidance in LLMs.
1. Formal Definitions and Problem Formulations
The mathematical formulation of AC is context-dependent:
- Community Q&A: Given and an observation window , predict , where if receives at least one answer within (Quora) (Maity et al., 2017).
- Conversational QA: Given (question and context sentence), predict iff is answerable from (CoQA/QNLI sentence-level), extended hierarchically to passage and conversation histories (Hwang et al., 2022).
- Information-Seeking Dialogue: Classifier predicts at sentence level if a candidate snippet contains (part of) the answer. Aggregations produce passage-level and ranking-level via max/mean (Łajewska et al., 21 Jan 2024).
- LLMs: For LLMs, AC reduces to computing whether the context supports an answer for , approximately via activation-level scoring (Lavi et al., 26 Sep 2025): with hidden state , define
and predict iff for threshold learned on dev data.
A table summarizing representative AC task settings:
| Domain | Input | Context Used | Label | Notable Formulation |
|---|---|---|---|---|
| Community Q&A | linguistic, meta | Answered / Open | SVM on Quora, (Maity et al., 2017) | |
| Info-Seeking Search | text corpus | Answerable / Unanswerable | BERT, passage-sentence aggregation (Łajewska et al., 21 Jan 2024) | |
| LLMs (Extractive QA) | passage | Answerable / Unanswerable | Directional activation scoring (Lavi et al., 26 Sep 2025) | |
| Video QA | video, script | Multiclass (5-way) | Llama2/SeViLA, timestamp anchoring (Yang et al., 30 Jan 2024) | |
| QA-based SCR | document | Answerable / Unanswerable | LLM-prompted pipeline (Aperstein et al., 10 Sep 2025) |
2. Methodologies: Modeling, Features, Architectures
Approaches to AC span classical classifiers, deep neural networks, LLM prompts, and activation-space analysis:
- Feature-Based Supervised Learning: Early work, e.g. on Quora (Maity et al., 2017), utilizes SVMs and engineered feature vectors. Features include:
- Surface (length, OOV rate, -gram presence)
- Syntax (POS tag diversity: )
- Topic modeling (LDA topic membership & diversity: )
- Edit/readability (ROUGE-LCS recall from original to edited question)
- Psycholinguistics (LIWC category fractions)
- Similarity/Nearest-Neighbor Models: For detection of unclear or unanswerable questions, retrieval-based features from similar historical questions are used. Key features include BM25 similarity sums, code block presence, length, and cosine similarity between clarification keyphrases (Trienes et al., 2019).
- Neural Architectures:
- Contextual Transformers: Sentence-level and passage-level AC via fine-tuned BERT/ALBERT encoders, using joint inputs and a classification head (Hwang et al., 2022, Łajewska et al., 21 Jan 2024).
- Multimodal Fusion: Video QA employs vision encoders (ViT-based or SeViLA) fused with transcript encoders; LLMs are augmented for 16k context tokens via rotary embeddings and FlashAttention (Yang et al., 30 Jan 2024).
- LLM Black-Box Evaluation: Zero-shot or prompt-based answerability verdicts (e.g., does the model extract an answer, or does a “JUDGE” LLM assert semantic equivalence) (Wang et al., 2023, Aperstein et al., 10 Sep 2025).
- Activation-Space Methods (LLMs): Identification of a linear direction where
and ranking direction candidates by their ability to steer model output toward abstention (measured via the log-odds of an “unanswerable” token) (Lavi et al., 26 Sep 2025).
3. Datasets and Labeling Strategies
Datasets underpinning AC research are characterized by their granularity, label rigor, and connection to downstream tasks:
Community Q&A
- Quora: 822,040 questions, 1.8M answers, labels based on time-to-first-answer (1 or 3 months) (Maity et al., 2017).
- Stack Exchange: Multiple sites (SO, SuperUser, AskUbuntu, etc.) with clear/unclear labels inferred by comment and edit heuristics (Trienes et al., 2019).
Information-Seeking and Conversational Search
- CAsT-Answerability: Sentence, passage, and ranking-level annotations derived from TREC CAsT, based on nugget overlap, passage inclusion, and combinatorial ranking (Łajewska et al., 21 Jan 2024).
- CoQA/QNLI-based: AC modules trained on sentence-level labeled Q–A pairs, incorporating both single-turn and multi-turn conversational data (Hwang et al., 2022).
LLM-Based and Document-Level
- Synthetic SQuAD Variants: For SCR, answerability is controlled by surgical information removal across paraphrased documents, yielding pairs with guaranteed label correctness (Aperstein et al., 10 Sep 2025).
- RepLiQA, NQ, MuSiQue, SQuAD2.0: Balanced sets of answerable/unanswerable pairs for LLM activation-analysis studies (Lavi et al., 26 Sep 2025).
Multimodal/Video QA
- YTCommentQA: 2,332 user-generated questions across 2,004 YouTube videos, labeled by human annotators after timestamp verification, script/visual examination, and consensus (Yang et al., 30 Jan 2024).
4. Evaluation Metrics and Results
Performance is measured using standard classification metrics; implementation choices impact interpretability and generalization:
- Community Q&A Models (Maity et al., 2017):
- SVM (1 month): Accuracy = 76.26%, Precision = 0.763, ROC AUC ≈ 0.762.
- Linguistic features alone: Accuracy = 74.18%.
- Unclear/Answerability Entry Points (Trienes et al., 2019, Łajewska et al., 21 Jan 2024):
- F1 (unclear/unanswerable class): CNN/BOW-LogReg ≈ 0.79.
- BERT (sentence, passage, ranking): 0.752, 0.634 (max agg), 0.891 (ranking mean agg).
- LLM Directions and Baselines (Lavi et al., 26 Sep 2025):
- Unanswerable recall (same dataset): classifier ≈ 87%, direction ≈ 83–86%.
- Cross-dataset F1: direction drops only ~7.4%, classifier ~30.2%.
- Synthetic QA Evaluation (Wang et al., 2023, Aperstein et al., 10 Sep 2025):
- PMAN (automatic metric): Accuracy for manually vetted not-yes/no: 0.94, F1 (answer/unanswerable) ≈ 0.91–0.94.
- SCR AC datasets ensure 100% label fidelity by construction.
- Video QA (Yang et al., 30 Jan 2024):
- Binary segment-level F1: Llama-2 (13B): 55.49, SeViLA: 46.55.
- Video-level multiclass accuracy: Llama-2 (13B): 37.70%, SeViLA: 35.27%; basic heuristics perform substantially worse.
A table of select results:
| Setting | Method/Baseline | Top Metric(s) | Value |
|---|---|---|---|
| Quora (1 mo) | SVM (full features) | Accuracy | 76.26% |
| Quora (1 mo) | Linguistic only | Accuracy | 74.18% |
| StackEx (unclear det.) | BoW LR / CNN | F1 (unclear) | ≈0.79 |
| CAsT (ranking AC) | BERT (max/mean agg.) | Accuracy | 0.891 |
| LLM Direction (cross-ds) | Direction (calibrated) | Macro-F1 / Recall | up to 11.9% over classifier |
| YTCommentQA (binary seg) | Llama-2 (13B) | F1 | 55.49 |
5. Feature Importance and Model Analysis
Comprehensive feature ablation and coefficient analyses across multiple works reveal:
- Language Use: LIWC psycholinguistic features, POS tag diversity (), and ROUGE-LCS all strongly correlate with answerability (Maity et al., 2017).
- Structural/Behavioral Factors: Edit counts, promotion behavior, and topic assignment changes are informative (Maity et al., 2017).
- Similarity-derived Cues: The need for clarifying information among similar questions (BM25+keyphrase statistics) robustly predicts unclear/unanswerable status (Trienes et al., 2019).
- Neural/Activation: Linear directions in LLM activation space allow identification of an “unanswerable” subspace without retraining, capturing abstention cues generalizable across datasets (Lavi et al., 26 Sep 2025).
Common failure modes include domain drift, low generalization under prompt-only or classifier-overfit approaches, surface question-type biases (attenuated in video QA), and challenges in modeling “partially” vs. “fully” answerable queries.
6. Applications and System Integration
AC modules are critical in applied and research settings:
- Q&A Platforms: Early screening for moderation, expert routing, and real-time improvement suggestions (Maity et al., 2017).
- Question Generation Evaluation: PMAN metric operationalizes answerability assessment for generated questions, correlating perfectly with human rank judgments across multiple models (Wang et al., 2023).
- Conversational Agents: Synthetic AC modules ensure that generated question–answer pairs in synthetic CQA data accurately reflect answerability, including triaging for “unknown” (Hwang et al., 2022).
- Semantic Document Analysis: AC serves as the measurement instrument for semantic coverage relations (equivalence, inclusion, overlap) between document pairs (Aperstein et al., 10 Sep 2025).
- LLM Hallucination Control: Activation-direction approaches provide knobs to steer abstention, mitigating overconfident but unsupported generations (Lavi et al., 26 Sep 2025).
- Video QA: Automated AC is foundational for full-pipeline response generation under significant multi-modality and long-context constraints (Yang et al., 30 Jan 2024).
7. Open Challenges and Future Directions
Research on AC consistently notes unresolved issues:
- Partial/Graded Answerability: Most work frames AC as binary even though degrees of answerability (full, partial, none) are prevalent (Łajewska et al., 21 Jan 2024).
- Multimodal Reasoning: Accurate integration of vision and script (including alignment, summarization, and cross-modal entailment) remains insufficiently addressed (Yang et al., 30 Jan 2024).
- Robustness and Generalization: The direction-based approach (Lavi et al., 26 Sep 2025) generalizes better than discriminative classifiers, but further improvements (e.g., multi-layer, multi-position fusion) are open.
- Efficient, Faithful Evaluation: Black-box LLM metrics like PMAN (Wang et al., 2023) and zero-shot QA pipelines (Aperstein et al., 10 Sep 2025) are effective but expensive and sensitive to model/version drift.
- Explainability and Feedback: Most AC models predict but do not surface the minimal sufficient context or recommend clarifications; UI integration remains underexplored (Trienes et al., 2019).
- Dataset Limitations: Many gold or synthetic datasets are bounded by annotation or LLM accuracy ceilings—for document-level SCR, only 8.2% of contexts support 100% reliable AC (Aperstein et al., 10 Sep 2025).
A plausible implication is that further integration of external knowledge bases, memory-augmented or hierarchical architectures, and richer supervision (multi-label, evidence-citing, uncertainty-aware) will be required for robust, general AC in dynamic or open-ended settings. Cross-domain and cross-modal transfer, as well as explainable, interactive AC, remain active research directions.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free