Super-NaturalInstructions Benchmark

Updated 9 February 2026

Super-NaturalInstructions is a large-scale meta-dataset that uses natural language task definitions to assess cross-task generalization in NLP models.
It includes 1,616 tasks across multiple domains and languages, each with clear human-readable instructions and curated positive/negative examples.
The benchmark supports systematic evaluation and scaling analysis, advancing research in instruction-driven NLP model performance.

Super-NaturalInstructions (Sup-NatInst) is a large-scale meta-dataset and benchmark designed to rigorously evaluate cross-task generalization in NLP models via declarative task instructions. The benchmark encompasses 1,616 diverse tasks, each accompanied by expert-crafted, human-readable instructions, and supports systematic assessment of models’ ability to generalize to entirely unseen tasks when provided only with natural language task definitions and a small set of training exemplars. Its design emphasizes scale, task diversity, and declarative instruction format as foundational pillars for advancing general-purpose, instruction-driven NLP models (Wang et al., 2022).

1. Motivation and Conceptual Foundations

Sup-NatInst addresses the question of how well NLP models can generalize to a broad spectrum of unseen tasks when guided solely by textual instructions. Recognizing limitations in earlier benchmarks (typically spanning only a few dozen tasks), Sup-NatInst expands both the number and diversity of tasks (1,616 tasks across 76 types) and languages (55, including 576 non-English tasks). Each task is defined by a natural language “Definition” and illustrated by positive and negative examples, ensuring the only information available to models at test time mirrors how humans interpret written instructions for unfamiliar tasks. This setup is intended for benchmarking true cross-task generalization, where a model is trained on following instructions for a subset of tasks and tested on its ability to extrapolate to held-out, unseen tasks (Wang et al., 2022).

The meta-dataset enables research into “learning to learn” from instructions, where model robustness and adaptability to new tasks are compared against narrow, task-specific pipelines. Declarative, human-readable instructions act as the mechanism for transferring task understanding, a core challenge for generalist NLP systems.

2. Dataset Structure and Statistics

Sup-NatInst’s 1,616 tasks are grouped into six broad categories:

Classification: Sentiment analysis, textual entailment
Span- or token-level extraction: Keyword tagging, overlap extraction
Text infilling: Masked question answering
Sequence tagging: Named-entity recognition
Text rewriting: Question rewriting, grammar correction
Text composition: Title generation, data-to-text, summarization

Each task follows a schematized instruction format:

Component	Description	Example
Definition	Natural language mapping from input to output	“Given two sentences, output ‘1’ if the first entails the second…”
Positive Examples	Input, correct output, brief explanation	Input: “Sentence 1:…”. Output: 0 (contradict)
Negative Examples	Input, incorrect output, brief explanation	Input: “Sentence 1:…”. Output: 0

Tasks span 33 domains and 55 languages, with a mean of 3,106 instances per task (approximately 5 million total), and task instructions averaging 2.8 positive and 2.4 negative exemplars. For evaluation, 154 tasks (covering 15,310 instances) are held out; 1,462 tasks are available for model training and validation. There are separate tracks for English (119 test, 757 training tasks) and cross-lingual (35 test, 1,271 training tasks) benchmarking (Wang et al., 2022).

3. Instruction Curation and Design Principles

Task instructions are produced via a collaborative process involving 88 NLP practitioners, utilizing:

Community-driven contributions: Tasks are submitted as JSON via GitHub.
Automated validation: Checks enforce format validity, instance balance, and elimination of duplicates.
Peer review: 1–2 subject experts ensure instructions are clear, complete, and concise.
Crowdsourced evaluation: For English tasks, Amazon Mechanical Turk workers flag ambiguities or typographical issues.

Guidelines emphasize self-contained, unambiguous definitions, properly illustrating correct and incorrect model behavior. Instructions aim for conciseness (mean 57 words per definition) and focus on natural, fluent language. Exemplar selection prioritizes informativeness for generalization. Each instruction must be processable and interpretable directly as input to an NLP system (Wang et al., 2022).

4. Model Architecture and Training (Tk-Instruct)

Tk-Instruct is a text-to-text generative model based on the T5 architecture, instantiated in two primary variants:

Tk-Instruct (English): T5-11B (11 billion parameters)
mTk-Instruct (Multilingual): mT5-13B (13 billion parameters)

Model input encoding uses the concatenation:

Definition: {definition}
Positive Example 1: input:{...} output:{...}
...
Now complete the following example – input: {x} output:

The training objective is standard cross-entropy on token sequences:

$\mathcal{L}(\theta) = -\sum_{t=1}^N \log p_\theta(y_t | y_{<t}, \text{enc}(I, x))$

with optimization via Adam (learning rate $10^{-5}$ , batch size approximately 1 million tokens). The T5-11B variant is trained for 1,000 steps on TPU v3-256 (approximately 4 hours); T5-3B and T5-Large are fine-tuned for 2 epochs on 8 × A100 GPUs (Wang et al., 2022).

5. Evaluation Protocols and Metrics

Sup-NatInst uses a split where test tasks are selected to ensure no data source overlap with training, mitigating the risk of data leakage. Testing covers 154 held-out tasks drawn from 12 categories. The primary aggregated metric is ROUGE-L (F-measure of longest common subsequence between generated and reference outputs):

$F = \frac{(1 + \beta^2) \cdot R \cdot P}{R + \beta^2 P}$

where $R = |\text{LCS}| / |\text{reference}|$ , $P = |\text{LCS}| / |\text{gen}|$ , and $\beta=1$ . For classification and certain analyses, Exact Match/Accuracy is also reported.

Human evaluation is conducted for English tasks via pairwise preference judgments (reference vs. model output; ties allowed), with models receiving credit if rated at least as good as the reference. Correlational analysis between human preference and ROUGE-L is provided (Pearson correlation: 0.998) (Wang et al., 2022).

6. Results and Scaling Laws

Tk-Instruct and mTk-Instruct demonstrate consistent outperformance over heuristic and pretrained LM baselines, as well as previous instruction-tuned models such as InstructGPT, across both aggregate and per-category scores.

Typical ROUGE-L scores:

| Model Type | ROUGE-L Score | |---------------------------------------|----------------------| | Heuristic baselines | 14–28 | | Pretrained LMs (T5-LM11B, GPT-3) | 30–45 | | T0 (11B) | 32 | | InstructGPT (175B) | 52 | | Tk-Instruct (11B) | 62 (+9.9 vs IGPT) | | mTk-Instruct (13B) | 66 (+13.3 vs IGPT) | | Supervised fine-tuning (upper bound) | ≈75 | | Human oracle (sampled tasks) | >96% |

Tk-Instruct achieves higher accuracy than InstructGPT while being an order of magnitude smaller. Human raters prefer Tk-Instruct’s outputs to reference on 77% of instances. Empirical scaling laws are observed:

Performance scales linearly with $\log_2$ (# tasks): doubling training tasks yields 2–3 ROUGE-L improvement.
Model size also contributes log-linearly: approximately 2 ROUGE-L points per parameter doubling.
Performance plateaus after ≈64 examples per task; further instances yield diminishing returns.
Approximate scaling law: Score $\approx a + b \log_2(\# \text{tasks}) + c\log_2(\# \text{params})$ (Wang et al., 2022).

7. Limitations and Future Research Directions

Several limitations are acknowledged:

Task distribution is still skewed towards English and short-form, extractive tasks.
ROUGE-L, while correlated with human judgment, is limited in evaluating open-ended generation and ignores task-specific nuances.
Computational requirements restrict exploration to ≤11B parameter models in practice.

Future directions include:

Extension to more languages and under-represented task types (long-form generation, structured outputs).
Development of unified, stronger metrics for open-ended generation.
Incorporation of additional modalities (e.g., vision, speech) and multimodal instructions.
Examination of instruction schema variants (increased demonstrations, compositional instructions) (Wang et al., 2022).

Sup-NatInst and Tk-Instruct offer a reproducible public platform and leaderboard that facilitate progress toward instruction-driven, generalist NLP systems. The results demonstrate that large-scale, diverse task-instruction pairs are effective for enabling model generalization to new task types—even with model sizes considerably smaller than proprietary baselines (Wang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Super-NaturalInstructions (Sup-NatInst) Benchmark.