SelfAware Dataset: LLM Uncertainty Benchmark
- SelfAware is a comprehensive benchmark that evaluates LLMs' capacity to recognize unanswerable questions and express uncertainty.
- It systematically pairs semantically similar answerable and unanswerable questions, leveraging metrics like F1 score to assess self-knowledge.
- Empirical findings highlight that instruction tuning and prompt engineering significantly boost model performance, though a gap with human benchmarks remains.
The SelfAware dataset is a benchmark for evaluating LLMs' (LLMs) capacity to recognize when posed questions are unanswerable, thus quantifying a crucial aspect of model self-knowledge and epistemic uncertainty. Introduced by Yin et al. (Yin et al., 2023), SelfAware systematically distinguishes between answerable and unanswerable questions, enabling the assessment of a model's ability to express uncertainty rather than generate incorrect but plausible responses. This capability is foundational for robust deployment of LLMs, particularly in sensitive or high-assurance settings.
1. Motivation and Conceptual Framework
The SelfAware dataset directly addresses the generalization and reliability gaps observed in prior work on LLM self-knowledge. While LLMs demonstrate extensive factual recall and reasoning in context-rich QA benchmarks, they often fail to recognize epistemic boundaries and, in practice, tend to generate confident yet erroneous answers to inherently unanswerable prompts. Previous datasets such as SQuAD2.0 and NewsQA include unanswerable questions but are context-dependent, as the questions may become answerable given additional information, limiting their utility for open-domain uncertainty estimation. Small-scale ad-hoc compilations proved insufficiently broad. The SelfAware dataset provides systematically paired unanswerable and semantically matched answerable questions, allowing controlled evaluation of binary self-knowledge: the positive class is genuine unanswerability, while the negative is routine QA (Yin et al., 2023).
2. Dataset Composition and Category Taxonomy
SelfAware consists of 3,369 questions, partitioned into 1,032 unanswerable and 2,337 answerable examples. Unanswerable questions are manually categorized into five types, each designed to reflect a distinct source of epistemic uncertainty:
| Category | Prevalence (%) | Example |
|---|---|---|
| No Scientific Consensus | ≈25 | "Are we alone in the universe, or will we discover alien life at some point?" |
| Imagination | ≈15 | "What will the fastest form of transportation be in 2050?" |
| Completely Subjective | ≈27 | "Would you rather be shot into space or explore the deepest depths of the sea?" |
| Too Many Variables | ≈10 | "John made $6 mowing lawns and$18 weed-eating. If he only spent $3 or$5 a week, how long would the money last him?" |
| Philosophical | ≈23 | "How come god was born from nothingness?" |
Answerable questions are drawn from SQuAD, HotpotQA, and TriviaQA, with one closest semantic match selected for each unanswerable example using SimCSE embeddings, resulting in balanced pairs for evaluation. This pairing ensures answerable controls with similar context and topics to the corresponding unanswerable items (Yin et al., 2023).
3. Construction Methodology and Validation
Unanswerable questions (2,858 candidates) were sampled from real-world Q&A forums (Quora, HowStuffWorks) and independently evaluated by three human experts with access to search engines. Only questions unanimously judged unanswerable (accounting for world knowledge as of annotation time) were retained, yielding 1,032 questions. For each, an answerable counterpart was selected by semantic similarity from a corpus pool exceeding 100,000, maintaining high inter-annotator agreement and manual verification for plausibility of match pairs. Answerable examples were sourced as follows: SQuAD (1,487), HotpotQA (182), TriviaQA (668). This dual-layered validation establishes both the epistemic validity and semantic proximity critical for benchmark integrity (Yin et al., 2023).
4. Evaluation Protocol and Metrics
The primary evaluation criterion is a model’s ability to express uncertainty in response to unanswerable questions while correctly answering the answerable ones. The metric suite includes:
- Uncertainty Detection Method: Model outputs are scored as "uncertain" if their textual content is sufficiently similar (via SimCSE embedding, threshold ) to any phrase in a curated uncertainty set (e.g., “The answer is unknown.”, “I’m not sure.”). Sliding windows of five sentences address response length variability. This method achieves an F1 ≈ 91.9% (on held-out validation) balancing recall and precision for uncertainty detection.
- Primary Metric (Self-Knowledge F1): Positive cases are unanswerable questions, with
where precision is the ratio of true positives (unanswerable correctly flagged as uncertain) to all positives flagged, and recall is the ratio of true positives to all actual positive cases.
- Accuracy for Answerable QA: Fraction of correctly answered negative-class (answerable) questions.
The evaluation also supports controlled experiments on the influence of input formats, e.g., direct prompt vs. explicit instruction vs. in-context examples (Yin et al., 2023).
5. Empirical Results and Key Findings
SelfAware was used to benchmark 20 LLMs, including GPT-3, InstructGPT, LLaMA, Alpaca, Vicuna, and GPT-4. The results reveal several core patterns:
- Self-knowledge (as measured by F1) increases with model scale and instruction tuning. InstructGPT outperforms base GPT-3; text-davinci is superior to davinci; Vicuna-13B surpasses LLaMA-65B, indicating benefits from fine-tuning and instruction-following.
- Prompt engineering materially impacts performance. Models presented with in-context examples and an explicit instruction to signal uncertainty achieve up to 28% F1 gain over direct prompts.
- The highest F1 (75.47%) on unanswerables is achieved by GPT-4 in instruction mode, still trailing the human benchmark (84.93%) measured on a 100-question subset.
- Accuracy on answerable questions also scales with model sophistication, ranging from approximately 2.5% (text-ada-001) to 42.6% (GPT-4) (Yin et al., 2023).
6. Implications and Limitations
SelfAware systematically exposes persistent gaps in LLM epistemic calibration: even state-of-the-art models, despite measurable self-knowledge, lag behind human baselines in distinguishing genuinely unanswerable questions from answerable ones. The utility of explicit instructions and in-context learning for boosting self-awareness suggests that training regime strongly affects epistemic boundary behaviors. A plausible implication is that ongoing advances in reasoning frameworks (e.g., Reflexion, Tree-of-Thought) and uncertainty quantification could further improve self-knowledge if integrated natively into model optimization.
A central limitation is reliance on a fixed, small pool of reference uncertainty utterances for detection; this may restrict generalization to new model output styles or domains. Future enhancements include automating the expansion of the uncertain-phrase set, and extending the methodology to non-textual (multimodal) and more open-ended epistemic settings (Yin et al., 2023).
7. Relation to Adjacent Benchmarks and Future Directions
The SelfAware dataset addresses a facet of model self-knowledge complementary to the situational awareness and autonomy capabilities tested by benchmarks such as the Situational Awareness Dataset (SAD) (Laine et al., 2024) and MM-SAP (Wang et al., 2024). While SAD interrogates models about their nature, operational context, and task-conditional behaviors, and MM-SAP extends self-awareness paradigms to vision-language settings, SelfAware isolates the boundary condition: knowing what you do not know in open-domain question answering. The dataset is released under CC-BY-SA-4.0 for unrestricted research use, with explicit encouragement for its adaptation and extension to close the gap between human-level and model-level metacognitive calibration (Yin et al., 2023).