ARC Corpus: A Science Text Dataset
- ARC Corpus is a large-scale, science-focused dataset of approximately 14 million sentences targeting elementary and middle-school science topics.
- It integrates web-harvested and curated scientific content with metadata to support retrieval, entailment, and reading comprehension pipelines using BM25 and neural reranking techniques.
- Researchers use the corpus for QA, curriculum automation, and training domain-specific language models, making it pivotal in advancing multi-hop reasoning and commonsense inference.
The ARC Corpus is a large-scale, science-focused text corpus released as part of the AI2 Reasoning Challenge (ARC), designed to support advanced question answering and reasoning tasks in elementary and middle-school science. Constituting approximately 14 million sentences sourced primarily from web domains and augmented with curated scientific definitions and articles, the ARC Corpus is tightly integrated with the ARC benchmark's multiple-choice question sets and serves as the primary knowledge base for retrieval, entailment, and reading comprehension pipelines. This resource delivers near-complete coverage of grade-school science vocabulary, spans all core K–8 scientific disciplines, and is publicly available under a permissive academic research license (Clark et al., 2018).
1. Corpus Construction and Scope
The ARC Corpus comprises ≈14 million sentences (≈1.4 GB plain text) selected to maximize relevance to 80 elementary and middle-school science topics, such as physics, biology, chemistry, earth science, and astronomy. The corpus was harvested using ≈100 hand-written web search templates instantiated over curated term lists (e.g., "[astronomical-term] astronomy") to generate focused queries (~720 for astronomy alone). Sources include:
- General web pages returned from commercial search engines, filtered for science relevance.
- “AristoMini” supplement with Wiktionary definitions, Simple Wikipedia science articles, and additional web-harvested scientific sentences.
Deduplication is performed at both document and sentence level. Although there is no absolute sentence-length cutoff at build time, downstream QA systems often filter out sentences exceeding 300 characters. Informal estimates indicate ≈75% of included pages are genuinely science-focused. The corpus covers 99.8% of the ARC question vocabulary (≈6329 stemmed words), ensuring robust lexical support across all major curricular domains.
2. Data Format, Metadata, and Preprocessing
ARC Corpus is distributed as a plain UTF-8 text file, with one sentence per line. Each sentence optionally links to an index or manifest file (JSONL/TSV), mapping back to:
- Source URL (web or AristoMini sub-source like Wiktionary/SimpleWiki).
- Document title or section/paragraph identifiers where available.
- Metadata fields per sentence: sentence_id (unique), text (raw string), provenance (URL or sub-source ID), source_type (web/AristoMini tag).
HTML markup is removed and de-duplication is applied once at build time. No forced lowercasing or tokenization occurs prior to distribution; these steps are left to the user, who may optionally apply standard NLP preprocessing (tokenization, lowercasing, stopword removal) for retrieval tasks.
3. Integration into Question Answering Pipelines
Typical QA pipelines index the corpus using a high-recall BM25-based IR engine (such as Elasticsearch or Lucene). For each (question, candidate answer) tuple, a combined query is constructed, top-K sentences are retrieved using BM25, and minimum overlap heuristics (non-stopword matches) are enforced to filter irrelevant results. The chosen answer candidate receives the score of its best supporting sentence.
BM25 scoring is defined:
- : frequency of term in document
- : inverse document frequency for term
- : total number of documents, : documents containing
- default: , , : average document length
Example QA pipelines use neural models for entailment reranking (e.g., DecompAttn, DGEM), or reading comprehension via pseudo-paragraph concatenation and span extraction (e.g., BiDAF).
4. Optimization, Usage, and Extensions
Best practice involves indexing the full corpus in an IR engine, querying with combined question/answer, enforcing relevancy heuristics, and reranking results with neural entailment or semantic graph inference methods (TableILP, TupleInference). For multi-hop reasoning, per-fact retrieval and chaining by sub-question or predicate is recommended. Sentence embeddings and dense retrieval methods (e.g., DPR) may be precomputed to accelerate lookup.
The ARC Corpus is released for public, non-commercial use at http://data.allenai.org/arc. Recommended downstream applications include:
- Science factoid QA and curriculum question generation.
- Commonsense or multi-hop reasoning research, where the corpus serves as “middle-layer” knowledge.
- Fine-tuning domain-specific LLMs on science sentences.
5. Applications and Impact
The ARC Corpus has become foundational in benchmarking advances in open-domain and science-specific QA systems, driving research on complex reasoning and retrieval-augmented architectures. It is used both for direct retrieval-based baselines and for fueling entailment, comprehension, and answer verification modules on the ARC Challenge and Easy Sets. Its breadth enables robust semantic coverage for tasks ranging from curriculum automation to commonsense inference and LLM pretraining. The corpus remains the largest public-domain science sentence set tailored for machine reasoning at the K–8 level (Clark et al., 2018).