DROP Benchmark: Discrete Reasoning

Updated 24 March 2026

DROP Benchmark is a high-complexity reading comprehension task that evaluates systems on discrete operations such as arithmetic, counting, and sorting over Wikipedia texts.
It features adversarially-authored, compositional questions that reveal significant performance gaps between state-of-the-art models and human accuracy.
The benchmark employs rigorous annotation protocols and diverse evaluation metrics to drive advances in semantic parsing, coreference resolution, and neural-symbolic reasoning.

The Discrete Reasoning Over Paragraphs (DROP) benchmark is a high-complexity English reading comprehension task designed to test systems’ ability to resolve references and execute discrete reasoning operations—such as addition, counting, sorting, and comparison—over Wikipedia paragraphs. Distinguished by compositional and adversarially-authored questions, DROP evaluates model proficiency in extracting and aggregating information from lengthy contexts, presenting a substantial challenge well beyond prior benchmarks. Contemporary SOTA models’ performance exhibits a large gap relative to human-level accuracy, highlighting open research questions at the intersection of reading comprehension, semantic parsing, and neural-symbolic reasoning (Dua et al., 2019).

1. Benchmark Construction and Dataset Statistics

The DROP dataset was constructed via adversarial, crowdsourced annotation, aiming to elicit questions demanding discrete operations and compositional reasoning. Annotators wrote question–answer pairs about passages sampled from Wikipedia (emphasizing sports game summaries, historical events, and entries rich in numerics), with a structural constraint to defeat a background BiDAF QA system, ensuring question complexity.

The dataset statistics are as follows:

Split	# Passages	# Questions	Avg. Passage Length (words)	Avg. Questions/Passage
Train	5,565	77,409	≈ 200	14–16
Development	582	9,536	≈ 200	14–16
Test	588	9,622	≈ 200	14–16

The question vocabulary in the training set spans approximately 30,000 tokens. Annotators were required to write a minimum of 12 question–answer pairs per HIT in 30 minutes, with answer types restricted to (a) contiguous spans from the passage or question, (b) integers (explicit units requested), and (c) dates.

Manual annotation of 350 questions sourced from the dataset yields the following approximate distribution of discrete reasoning operations:

Reasoning Type	Proportion (%)
Subtraction	28.8
Selection (set membership)	19.4
Comparison (min/max)	18.2
Counting	16.5
Addition	11.7
Sorting + argmax/argmin	11.7
Sets of spans	6.0
Coreference + arithmetic	3.7
Other (percentages, multiplication)	3.2
Miscellaneous linguistic	6.8

2. Task Definition and Formalization

Formally, each DROP instance consists of a passage $P = (p_1, \ldots, p_n)$ and a question $Q = (q_1, \ldots, q_m)$ . The expected system output is an answer $a \in A$ , where

$A$ is the union of (i) a contiguous span of $P$ ; (ii) a contiguous span of $Q$ ; (iii) an integer; (iv) a date.

Many questions refer to explicit spans in $P$ and require subsequent aggregation. Typical system behavior includes soft-matching references, extracting values or sets, and conducting arithmetic or comparative operations to derive final answers.

These operations challenge systems to move beyond simple pattern matching or shallow inference: instead, significant reference resolution, event chaining, and numerical computation are essential.

3. Evaluation Metrics

DROP employs evaluation protocols refined for compositional reading comprehension and numeracy:

Exact Match (EM) / Generalized Accuracy: A normalized string comparison between predicted and gold answer strings, marking success only upon exact correspondence.
Numeracy-focused $F_1$ : Analogous to SQuAD's token-level $F_1$ $F_{1}$ , but enforcing that if any numeric token mismatches, $F_1 = 0$ $F_{1} = 0$ . For single-span answers,
- $\mathrm{precision} = \frac{\# \text{overlapping tokens}}{\# \text{tokens in prediction}}$
- $\mathrm{recall} = \frac{\# \text{overlapping tokens}}{\# \text{tokens in gold}}$
- $F_1 = 2\,\times\, \frac{\mathrm{precision}\,\times\,\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}$

For example, if the gold answer is “4 300 000” (span “$4.3$ million”) and the prediction is “4.3 million,” with both tokens in overlap, $F_1 = 1.0$ due to perfect agreement.

4. Baseline Systems and Empirical Results

DROP’s question design renders simple heuristic and SQuAD-style extractive models ineffective. The following summarizes baseline and model performance:

Model/Baseline	Dev F₁	Test F₁
Majority (top-3 by q-word)	1.4	—
Question-only	8.1	—
Passage-only	2.3	—
Semantic parser (SRL/OpenIE)	∼11	—
BiDAF	28.9	27.5
QANet	30.4	28.4
QANet+ELMo	30.3	29.7
BERT	33.4	32.7
Human (expert)	—	96.4

These results confirm that DROP exceeds the complexity of prior RC tasks. State-of-the-art reading comprehension approaches plateau below 35% F₁ (test), far from human reliability, especially on questions involving multiple database-like operations, coreference, and arithmetic (Dua et al., 2019).

5. The NAQANet Model: Integrating Numeric Reasoning with RC

NAQANet is introduced as a specialized RC model embedding explicit modules for discrete numeric computation atop a QANet encoder backbone. The model decomposes final predictions through five “heads”:

Passage-span head—locates answer spans in $P$ .
Question-span head—predicts answer spans in $Q$ , conditioned on passage encoding.
Count head—classifies answers as final counts from $\{0,\dots,9\}$ .
Arithmetic head—extracts all numeric tokens from $P$ and assigns signs $\in \{+1, -1, 0\}$ , computing $\sum_i \text{sign}_i \times \text{value}_i$ to produce addition/subtraction answers.
Type head—predicts overall answer-type via a softmax over concatenated passage/question embeddings.

Training employs weak supervision: the model enumerates all latent executions (spans, sign assignments, counts) that would yield the correct answer and maximizes marginal likelihood over them. Performance improves over previous models:

Dev: EM = 46.2%, F₁ = 49.2%
Test: EM = 44.1%, F₁ = 47.0%

Error decomposition on 100 development examples indicates principal difficulties in arithmetic-operation prediction (51%) and counting (30%), with notable multi-step compositional errors (40%), domain knowledge gaps (23%), and coreference linking mistakes (6%).

6. Principal Challenges and Research Directions

The core research value of DROP is to expose the limitations of textual machine comprehension when discrete reasoning is required. Key unresolved challenges include:

Arithmetic and Multi-Span Reasoning: Existing RC models cannot perform robust discrete numeric or multi-span operations, a frequent demand in DROP.
Semantic Parsing Weaknesses: Grammar-constrained semantic parsers are limited by spurious logical forms and dependency on high-quality information extraction pipelines, with incomplete coverage (∼25–34%).
Complex Question Demands: Many questions necessitate multi-stage reasoning, reference resolution, chaining of events, and incorporation of world knowledge.

Error types encountered in modeling DROP include incorrect sign assignments, missing/extra mention counting, pronoun/event dereferencing failures, arithmetic off-by-one errors, and unfamiliar domain-specific terms.

Open research directions suggested by these failures encompass:

Neural-symbolic models dynamically merging span extraction and aggregation.
Joint modeling of coreference, event detection, and arithmetic modules.
Un/semi-supervised pre-training for discrete numerical tasks.
Expansion to broader logical operations, including multiplication, division, quantifiers, and structured logical forms.

A plausible implication is that bridging the performance gap to human-level comprehension will require the synthesis of advanced neural reading comprehension approaches with explicit symbolic computation capabilities and improved overall RC architectures (Dua et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DROP Benchmark.