QuALITY: Long-Document QA Benchmark

Updated 16 February 2026

QuALITY is a benchmark for QA systems that evaluates comprehension over long documents with average context lengths around 5,159 tokens.
The dataset employs a rigorous multiple-choice format and a unique speed validation process to create HARD questions that resist simple skimming.
Baseline transformer models show substantial performance gaps compared to human annotators, underscoring the challenge of long-context understanding.

QuALITY is a benchmark for question answering (QA) systems that evaluates comprehension of long documents, introducing uniquely rigorous challenges for machine reading due to both context length and the nature of the questions. The dataset comprises multiple-choice questions written and validated by annotators who have read full context passages averaging 5,159 word-level tokens—lengths that exceed the processing capabilities of most existing transformer models. Approximately half of the questions are constructed to be unanswerable under tight time constraints, ensuring that simple retrieval- or skimming-based strategies are insufficient. On this task, baseline transformer systems exhibit a substantial performance gap relative to human annotators, establishing QuALITY as a demanding benchmark for research into long-context comprehension (Pang et al., 2021).

1. Task Specification and Formalism

The QuALITY task involves multiple-choice QA over extended context passages. Formally, let $C$ be a context (long passage), $Q$ the question, and $\{A_1,\ldots,A_k\}$ the $k=4$ multiple-choice options. A model must compute probabilities $P(A_i \mid C, Q)$ for $i=1\ldots 4$ and select

$\hat{i} = \arg\max_{i=1,\ldots,4} P(A_i \mid C, Q).$

The loss function is standard cross-entropy over the gold index $i^*$ :

$\mathcal{L} = -\sum_{i=1}^4 \mathbb{1}[i = i^*] \log P(A_i \mid C, Q).$

Evaluation is accuracy, computed as the fraction of correctly answered questions in a test set:

$\text{Accuracy} = \frac{1}{N} \sum_{n=1}^N \mathbb{1}[\hat{i}_n = i^*_n].$

Other metrics such as F1 are not used due to the multiple-choice format.

2. Dataset Composition and Collection Protocol

QuALITY draws CC-BY licensed texts from diverse domains: Project Gutenberg science fiction (1950s–1970s), Slate magazine (Open American National Corpus), and miscellaneous nonfiction sources. Passage lengths range from approximately 2,000 to 8,000 words (max 7,759; mean 5,159).

The dataset includes 285 articles with 6,737 validated QA pairs, after filtering an original set of 7,620 (88.4% retention). Data is partitioned into train (150 articles, 2,523 Qs, 49.5% HARD), development (115 articles, 2,086 Qs, 51.1% HARD), and test (116 articles, 2,128 Qs, 49.1% HARD) splits, specifically to minimize stylistic overlap among question writers.

The question writing and validation pipeline:

Authorship: Two trained writers per passage each author 10 questions, each with 4 answer options.
Speed Validation: Five crowdworkers attempt each question with only 45 seconds to skim the context. If at least 3/5 annotators answer incorrectly, the question is labeled HARD.
Untimed Validation: Three to five annotators, with unlimited time, answer and rate the question for answerability and ambiguity.
Feedback and Incentives: Writers receive detailed feedback and financial bonuses for high-quality, speed-hard questions.
Inclusion Criteria: Only questions with a clear majority gold label and positive answerability ratings from untimed annotators are retained.

Estimated human performance, based on a held-out sample (367 Qs, 3 new annotators per question): overall accuracy is 93.5% (EASY subset: 97.0%, HARD subset: 89.1%).

3. Structural and Reasoning Properties of Questions

Questions in QuALITY were crafted with attention to full-document semantics, rendering lexical overlap heuristics ineffective (highest token overlap achieves only 26.6% accuracy).

Surface-form distribution:

what…? (42.2%)
why…? (24.6%)
how…? (11.9%)
which…? (7.4%)
who…? (4.2%)
other (how-many, yes/no, where, when, etc.)

Reasoning type analysis (500 sampled Qs, multiple labels possible):

Reasoning Type	% of Qs
Description	33.2
Why/Reason	31.2
Symbolism/Interpretation	27.8
How/Method	8.8
Event	7.0

Questions labeled HARD (49.9%; 3,360 Qs) most often defeat skimming and local retrieval strategies; annotator accuracy is approximately 48.2% on HARD instances (improving from 39.5% to 58.4% over collection rounds).

4. Baseline Architectures and Experimental Protocols

Long-context encoders (no retrieval):

Longformer-base (window: 4,096 tokens)
LED-large (window: 16,384 tokens)

Retrieval + short-context reading pipelines:

Each sentence of $C$ $C$ is scored against $Q$ $Q$ using:
- ROUGE-1 recall to $Q$
- fastText-cosine similarity (word-average)
- DPR (dense passage retriever)
Top sentences (up to ~300 words) are joined and input to short-context models.

Short-context reader models (max input: 512 tokens):

RoBERTa-base/large
DeBERTaV3-base/large
T5-base

Training regimes:

QU: Fine-tune on QuALITY (20 epochs)
RA→QU: Fine-tune on RACE (3 epochs) then QuALITY (20 epochs)
Zero-shot RACE: Fine-tune on RACE only

Optimization: Default LR= $1\mathrm{e}{-5}$ , warmup=10%, batch size 8–16; max context length set by model/hardware. T5-base uses LR= $1\mathrm{e}{-4}$ , 40k steps.

5. Empirical Results and Error Analysis

Model (train → test)	Test Acc. (Full)	Test Acc. (HARD)
Human (3 annotators)	93.5%	89.1%
DeBERTaV3-large (RACE→QU, DPR)	55.4%	46.1%
RoBERTa-large (RACE→QU, DPR)	51.4%	44.7%
DeBERTaV3-large (QU only, DPR)	49.0%	41.2%
RoBERTa-large (QU only, DPR)	42.7%	35.7%
Longformer-base (no retrieval)	30.7%	29.3%
LED-large (no retrieval)	24.2%	24.5%
Question-only (DeBERTaV3)	43.3%	38.2%

Model–human gaps are 38.1 percentage points on the full set and 42.3 points on HARD questions. DPR-based retrieval outperforms ROUGE-1 and fastText by 3–5 points. An "oracle answer" setup—using the gold answer to guide IR—yields only ~78% accuracy, showing that no retrieval heuristic alone suffices.

Qualitative analysis identifies persistent failures in:

Multi-sentence inference
Discourse-level coreference
Long-range causal chains
Symbolic and interpretive questions Models are particularly deficient on "why" and "interpretation" items that resist direct extraction.

6. Contributions, Challenges, and Future Research

QuALITY advances the field by introducing:

A high-quality, large-scale, human‐validated multiple-choice QA dataset with unusually long (~5k token) contexts.
The speed validation protocol (45s skimming) that adversarially separates questions answerable via skimming from those requiring comprehension.
Comprehensive baselines across long‐context transformers, retrieval+reader pipelines, and transfer regimes.
Systematic analysis of question/reasoning types and annotated human–machine performance disparities.

Identified challenges:

Transformer models cannot reliably register and exploit information across entire >1k token contexts.
Both retrieval/truncation and pure long-context encoding approaches lose global discourse structure or key evidence.
Pretrained models lack inductive biases for specific long-range or interpretive reasoning phenomena prevalent in QuALITY.

Proposed research directions:

Development of architectures and pretraining procedures for >8k token context windows (e.g., BigBird, ETC, hierarchical memory).
Elaboration of supervised intermediate tasks to foster long-context reasoning ability.
Refined, answer-aware and iterative retrieval mechanisms with windowed rereading strategies.
Incorporation of explicit discourse- or graph-based modules for inference across multi-sentence spans.
Specialized data augmentation and adversarial question generation tailored to the characteristics of long contexts.

QuALITY thereby establishes a rigorous and multifaceted challenge at the frontier of long-document QA—requiring advances in both scalable model architectures and deep understanding strategies capable of holistic, cross-document reasoning with naturalistic, human-generated queries (Pang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

QuALITY: Question Answering with Long Input Texts, Yes! (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QuALITY Comprehension Task.