QuALITY: Long-Document QA Benchmark
- QuALITY is a benchmark for QA systems that evaluates comprehension over long documents with average context lengths around 5,159 tokens.
- The dataset employs a rigorous multiple-choice format and a unique speed validation process to create HARD questions that resist simple skimming.
- Baseline transformer models show substantial performance gaps compared to human annotators, underscoring the challenge of long-context understanding.
QuALITY is a benchmark for question answering (QA) systems that evaluates comprehension of long documents, introducing uniquely rigorous challenges for machine reading due to both context length and the nature of the questions. The dataset comprises multiple-choice questions written and validated by annotators who have read full context passages averaging 5,159 word-level tokens—lengths that exceed the processing capabilities of most existing transformer models. Approximately half of the questions are constructed to be unanswerable under tight time constraints, ensuring that simple retrieval- or skimming-based strategies are insufficient. On this task, baseline transformer systems exhibit a substantial performance gap relative to human annotators, establishing QuALITY as a demanding benchmark for research into long-context comprehension (Pang et al., 2021).
1. Task Specification and Formalism
The QuALITY task involves multiple-choice QA over extended context passages. Formally, let be a context (long passage), the question, and the multiple-choice options. A model must compute probabilities for and select
The loss function is standard cross-entropy over the gold index :
Evaluation is accuracy, computed as the fraction of correctly answered questions in a test set:
Other metrics such as F1 are not used due to the multiple-choice format.
2. Dataset Composition and Collection Protocol
QuALITY draws CC-BY licensed texts from diverse domains: Project Gutenberg science fiction (1950s–1970s), Slate magazine (Open American National Corpus), and miscellaneous nonfiction sources. Passage lengths range from approximately 2,000 to 8,000 words (max 7,759; mean 5,159).
The dataset includes 285 articles with 6,737 validated QA pairs, after filtering an original set of 7,620 (88.4% retention). Data is partitioned into train (150 articles, 2,523 Qs, 49.5% HARD), development (115 articles, 2,086 Qs, 51.1% HARD), and test (116 articles, 2,128 Qs, 49.1% HARD) splits, specifically to minimize stylistic overlap among question writers.
The question writing and validation pipeline:
- Authorship: Two trained writers per passage each author 10 questions, each with 4 answer options.
- Speed Validation: Five crowdworkers attempt each question with only 45 seconds to skim the context. If at least 3/5 annotators answer incorrectly, the question is labeled HARD.
- Untimed Validation: Three to five annotators, with unlimited time, answer and rate the question for answerability and ambiguity.
- Feedback and Incentives: Writers receive detailed feedback and financial bonuses for high-quality, speed-hard questions.
- Inclusion Criteria: Only questions with a clear majority gold label and positive answerability ratings from untimed annotators are retained.
Estimated human performance, based on a held-out sample (367 Qs, 3 new annotators per question): overall accuracy is 93.5% (EASY subset: 97.0%, HARD subset: 89.1%).
3. Structural and Reasoning Properties of Questions
Questions in QuALITY were crafted with attention to full-document semantics, rendering lexical overlap heuristics ineffective (highest token overlap achieves only 26.6% accuracy).
Surface-form distribution:
- what…? (42.2%)
- why…? (24.6%)
- how…? (11.9%)
- which…? (7.4%)
- who…? (4.2%)
- other (how-many, yes/no, where, when, etc.)
Reasoning type analysis (500 sampled Qs, multiple labels possible):
| Reasoning Type | % of Qs |
|---|---|
| Description | 33.2 |
| Why/Reason | 31.2 |
| Symbolism/Interpretation | 27.8 |
| How/Method | 8.8 |
| Event | 7.0 |
Questions labeled HARD (49.9%; 3,360 Qs) most often defeat skimming and local retrieval strategies; annotator accuracy is approximately 48.2% on HARD instances (improving from 39.5% to 58.4% over collection rounds).
4. Baseline Architectures and Experimental Protocols
Long-context encoders (no retrieval):
- Longformer-base (window: 4,096 tokens)
- LED-large (window: 16,384 tokens)
Retrieval + short-context reading pipelines:
- Each sentence of is scored against using:
- ROUGE-1 recall to
- fastText-cosine similarity (word-average)
- DPR (dense passage retriever)
- Top sentences (up to ~300 words) are joined and input to short-context models.
Short-context reader models (max input: 512 tokens):
- RoBERTa-base/large
- DeBERTaV3-base/large
- T5-base
Training regimes:
- QU: Fine-tune on QuALITY (20 epochs)
- RA→QU: Fine-tune on RACE (3 epochs) then QuALITY (20 epochs)
- Zero-shot RACE: Fine-tune on RACE only
Optimization: Default LR=, warmup=10%, batch size 8–16; max context length set by model/hardware. T5-base uses LR=, 40k steps.
5. Empirical Results and Error Analysis
| Model (train → test) | Test Acc. (Full) | Test Acc. (HARD) |
|---|---|---|
| Human (3 annotators) | 93.5% | 89.1% |
| DeBERTaV3-large (RACE→QU, DPR) | 55.4% | 46.1% |
| RoBERTa-large (RACE→QU, DPR) | 51.4% | 44.7% |
| DeBERTaV3-large (QU only, DPR) | 49.0% | 41.2% |
| RoBERTa-large (QU only, DPR) | 42.7% | 35.7% |
| Longformer-base (no retrieval) | 30.7% | 29.3% |
| LED-large (no retrieval) | 24.2% | 24.5% |
| Question-only (DeBERTaV3) | 43.3% | 38.2% |
Model–human gaps are 38.1 percentage points on the full set and 42.3 points on HARD questions. DPR-based retrieval outperforms ROUGE-1 and fastText by 3–5 points. An "oracle answer" setup—using the gold answer to guide IR—yields only ~78% accuracy, showing that no retrieval heuristic alone suffices.
Qualitative analysis identifies persistent failures in:
- Multi-sentence inference
- Discourse-level coreference
- Long-range causal chains
- Symbolic and interpretive questions Models are particularly deficient on "why" and "interpretation" items that resist direct extraction.
6. Contributions, Challenges, and Future Research
QuALITY advances the field by introducing:
- A high-quality, large-scale, human‐validated multiple-choice QA dataset with unusually long (~5k token) contexts.
- The speed validation protocol (45s skimming) that adversarially separates questions answerable via skimming from those requiring comprehension.
- Comprehensive baselines across long‐context transformers, retrieval+reader pipelines, and transfer regimes.
- Systematic analysis of question/reasoning types and annotated human–machine performance disparities.
Identified challenges:
- Transformer models cannot reliably register and exploit information across entire >1k token contexts.
- Both retrieval/truncation and pure long-context encoding approaches lose global discourse structure or key evidence.
- Pretrained models lack inductive biases for specific long-range or interpretive reasoning phenomena prevalent in QuALITY.
Proposed research directions:
- Development of architectures and pretraining procedures for >8k token context windows (e.g., BigBird, ETC, hierarchical memory).
- Elaboration of supervised intermediate tasks to foster long-context reasoning ability.
- Refined, answer-aware and iterative retrieval mechanisms with windowed rereading strategies.
- Incorporation of explicit discourse- or graph-based modules for inference across multi-sentence spans.
- Specialized data augmentation and adversarial question generation tailored to the characteristics of long contexts.
QuALITY thereby establishes a rigorous and multifaceted challenge at the frontier of long-document QA—requiring advances in both scalable model architectures and deep understanding strategies capable of holistic, cross-document reasoning with naturalistic, human-generated queries (Pang et al., 2021).