UnsolvableQA: Unanswerable Question Framework
- UnsolvableQA is a framework that categorizes questions as unsolvable when they contain inherent contradictions, missing context, or ill-posed premises.
- Benchmark creation leverages programmatic generation, reverse-engineered solution chains, and synthetic context augmentation to ensure challenging unsolvable instances.
- Empirical results demonstrate improved detection of unsolvable cases and calibration of model responses, reducing overconfidence and faulty answer generation.
UnsolvableQA concerns the rigorous identification, synthesis, and evaluation of problems deliberately constructed to be unsolvable by a reasoning or question-answering (QA) system, as well as the design of training and evaluation methodologies that incentivize reliable detection and appropriate abstention. Central to the field is the alignment of LLMs and other automated systems not only towards solving feasible tasks but also toward reliably recognizing intrinsic contradictions, missing information, or exceeding capacity—thereby mitigating hallucination and overconfidence on intractable instances (Peng et al., 1 Dec 2025). The UnsolvableQA paradigm now underpins research across reading comprehension, multi-modal reasoning, knowledge base question answering, and foundational AGI evaluation.
1. Formal Definition and Taxonomy of Unsolvability
A problem is classified as unsolvable if, under the assumptions of the underlying domain (e.g., mathematics, logic, KB coverage, or factual completeness), no coherent solution can be constructed, or the solution is inherently unknowable. Unsolvability arises in several forms:
- Objective contradiction: The question or problem statement encodes mutually exclusive or axiom-violating constraints (e.g., "find an integer simultaneously greater and less than zero") (Peng et al., 1 Dec 2025).
- Schema or data incompleteness: In knowledge base QA, unsolvability results from type, relation, or entity omission causing either (i) no valid logical form to exist or (ii) logical form execution yielding an empty answer (Patidar et al., 2022).
- Information absence or retrieval miss: In information-seeking settings, unanswerability arises when the context or evidence lacks the necessary information, or a question is posed with false or ambiguous premises (Asai et al., 2020).
- Ambiguity or ill-posedness: Some questions, especially in open-domain or philosophical settings, are unanswerable due to multiple interpretations or lack of a definitive ground truth (Agarwal et al., 2023).
- Beyond capacity: A distinction is made between objective unsolvability and problems that are solvable in principle but exceed model capability; reliable systems must prudently abstain in both cases (Peng et al., 1 Dec 2025).
These categories are operationalized via concrete annotations in benchmarks such as SQuAD 2.0 (adversarially crowdwritten unanswerables) (Rajpurkar et al., 2018), GrailQAbility (taxonomy of KB incompleteness) (Patidar et al., 2022), and multi-modal sets like VisionTrap (contradictory logic or physically impossible image-question pairs) (Saadat et al., 23 Jul 2025).
2. Construction of UnsolvableQA Benchmarks
UnsolvableQA datasets are developed using multiple methodologies tailored to domain characteristics:
- Programmatic generation for logic and combinatorics: For domains with tractable decision procedures, unsolvable instances are algorithmically verified by exhaustive search, SAT/CSP solvers, or explicit bottleneck injection. E.g., unsolvable Game24 and Hamiltonian Cycle problems in UnsolvableQA are generated and confirmed by exhaustive rational-arithmetic or SAT solving (Peng et al., 1 Dec 2025).
- Reverse Construction for mathematics: A solvable problem's solution chain (CoT) is tampered by injecting a logical or axiom contradiction, and the problem statement is reverse-engineered so the corrupted CoT would "solve" it, followed by verification that a strong model cannot produce a solution (Peng et al., 1 Dec 2025).
- Synthetic context re-matching and augmentation: Automated pipelines (AGent) create unanswerables by pairing questions with contexts that lack appropriate answers, using TF-IDF similarity and adversarial/plausibility filtering to maximize difficulty (Tran et al., 2023).
- Lexico-semantic perturbation: Antonym-swaps and entity-swaps generate diverse, high-fluency unanswerables that retain surface similarity to answerable questions, challenging models relying on overlapping tokens (Gautam et al., 2023).
- Curated open-domain harvesting: Community-sourced unsolved problems ("UQ") are filtered by rules (e.g., engagement thresholds), LLM judges, and human experts to ensure both difficulty and real-world value, used in asynchronous evaluation platforms (Nie et al., 25 Aug 2025).
Empirical datasets often enforce a close pairing of solvable and unsolvable instances to facilitate robust decision boundary learning, with controlled test splits and systematic coverage across domains (e.g., six domain splits in (Peng et al., 1 Dec 2025), fine-grained unsolvability types in (Patidar et al., 2022), and image-based logical impossibilities in (Saadat et al., 23 Jul 2025)).
3. Model Architectures and Training Objectives
UnsolvableQA work has motivated model and training innovations for robust three-way decision making (solve, reject-unsolvable, abstain-overcapacity):
- Composite reward frameworks: The UnsolvableRL framework adopts Group-Relative Policy Optimization with a three-component reward, penalizing hallucinated answers on unsolvable problems (), rewarding abstentions when appropriate (), and directly optimizing answer accuracy () (Peng et al., 1 Dec 2025).
- Unified span and abstention architectures: MRC models integrate span prediction and answerability classification—see null-span/BCE approaches in SQuAD 2.0 (Rajpurkar et al., 2018), U-Net’s universal node for fused Q/P encoding (Sun et al., 2018), as well as explicit "no logical form" targets in KBQA (Patidar et al., 2022).
- Iterative feedback with verifiers: FUn-FuSIC for KBQA employs multi-pass LLM reasoning, weak verifiers (e.g., semantic back-translation, type-checking), and customized self-consistency for unanswerability to disambiguate empty-answer as genuine unsolvability (Sawhney et al., 20 Jun 2024).
- Calibration and capability heads: Adaptive incentive schedules (target accuracy increasing over training) regularize refusal behavior, avoiding degenerate "always answer" or "always abstain" policies (Peng et al., 1 Dec 2025).
Novelty in these formulations lies in explicit optimization against overconfidence, gradient interference (capability collapse), and careful calibration of negative signals, as established by both theoretical and empirical ablations (Peng et al., 1 Dec 2025).
4. Evaluation Metrics and Empirical Results
UnsolvableQA research relies on both standard and custom metrics to examine detection and calibration:
| Metric | Purpose | Example Tasks |
|---|---|---|
| EM/F1 (all/split) | Exact or token match | SQuAD 2.0, GrailQAbility |
| Unanswerable EM/F1 | Correct abstentions/rejects | Logic, KBQA, VQA |
| Pass rate / Verified rate | Fraction passing validator, fraction verified correct | UQ Platform (Nie et al., 25 Aug 2025) |
| Calibration measures | Accuracy at fixed threshold, abstention rate | VisionTrap, Impossible Test |
| Regression/statistical analysis | Uncertainty as function of problem difficulty or category | Impossible Test (Noever et al., 20 Nov 2024) |
Key empirical findings:
- Fine-tuned large models on UnsolvableQA exhibit substantial improvement: unsolvable detection rises from 36.3% to 90.9% under UnsolvableRL, and Game24 rejection accuracy improves from 17% to 99% (Peng et al., 1 Dec 2025).
- Without explicit exposure to unsolvable items, smaller models and LLMs exhibit "capability collapse," with unsolvability rejection accuracy approaching zero under continued training (Peng et al., 1 Dec 2025).
- In VQA, failure to abstain or over-abstention is common; e.g., LLaVA abstains 97% of the time when options are not presented, while Gemini Flash only abstains at 29% on impossible questions (Saadat et al., 23 Jul 2025).
- On community-sourced unsolved problems (UQ), top models pass validation on only 15% of questions; human verification shows only a handful of genuine breakthroughs (Nie et al., 25 Aug 2025).
- On epistemic humility ("Impossible Test"), even the best (Gemini/Claude) models admit ignorance on ~69% of unsolvable items; performances are significantly lower in invention and NP-hardness categories (Noever et al., 20 Nov 2024).
5. Failure Modes and Theoretical Underpinnings
Failures in UnsolvableQA systems consistently manifest as overconfident hallucinations (proposing plausible but incorrect answers), or as degenerate abstention (universal rejection):
- Gradient interference: Training only on solvable problems suppresses refusal logits on unsolvable instances, destroying the boundary required to distinguish contradictions (Peng et al., 1 Dec 2025).
- Verifier/generation mismatch: In iterative KBQA, naive execution-guided feedback treats all empty answers as fixable, failing to recognize schema-level unsolvability; addition of weak verifiers and back-translation feedback is essential (Sawhney et al., 20 Jun 2024).
- Heuristic bias: VLMs in VisionTrap exhibit format sensitivity, e.g., guessing when prompted with options, abstaining when forced, reflecting underdeveloped uncertainty heads (Saadat et al., 23 Jul 2025).
- Data-augmentation limitations: Entity and antonym swaps can still produce answerable cases (≈20% noise), reducing selectivity; purely answerable-skewed training produces models highly vulnerable to spurious context-answer overlap (Gautam et al., 2023).
Gradient and statistical analyses expose the necessity of negative examples—unsolvable or out-of-capacity cases—to defend against both types of failure (Peng et al., 1 Dec 2025, Gautam et al., 2023).
6. Implications for Alignment, AGI Evaluation, and Future Research
UnsolvableQA techniques now inform core recommendations for model reliability, alignment, and AGI evaluation:
- Objective unsolvability exposure: Only by synthetic/intrinsic contradiction exposure (not just "out of scope") can models avoid hallucination and evade capability collapse (Peng et al., 1 Dec 2025, Gautam et al., 2023).
- Calibration-by-design: Incentivizing abstention proportionally to batch accuracy and penalizing false rejections, as in adaptive refusal schedules, prevents both reckless answering and universal declination (Peng et al., 1 Dec 2025).
- Epistemic humility metrics: AGI tests now incorporate "accuracy of uncertainty acknowledgment" (e.g., proportion of "I don't know" on Impossible Test), providing a quantifiable complement to standard accuracy (Noever et al., 20 Nov 2024).
- Verifier-driven pipelines: As in the UQ platform, community-driven, validator-assisted, and human-verified workflows are leveraged to track the genuine advancement of frontier models on unsolved problems (Nie et al., 25 Aug 2025).
Future research targets include scaling unsolvability detection to open-ended language, multi-hop reasoning, more diverse domains, and strategic negative data synthesis for global model calibration. Deeper theoretical work is warranted on gradient interference in underparameterized regimes, uncertainty estimation, and generalization of UnsolvableQA paradigms to multi-agent and embodied AI settings (Peng et al., 1 Dec 2025, Sawhney et al., 20 Jun 2024, Noever et al., 20 Nov 2024).