Super-Natural Instructions Benchmark
- The benchmark introduces a unified evaluation suite with 1,616 tasks, overcoming previous limitations in sparse and inconsistent instruction data.
- Sup-NatInst enables robust assessment of cross-task and cross-lingual generalization through both natural language and pseudo-code instruction formats.
- It provides a reproducible framework complete with standardized datasets, baselines, model checkpoints, and a public leaderboard to drive further advances.
The Super-NaturalInstructions (Sup-NatInst) benchmark is a large-scale evaluation suite designed to enable rigorous assessment of cross-task generalization in NLP models under natural language task instruction. Introduced by Wang et al. (2022), Sup-NatInst provides 1,616 diverse NLP tasks across 76 task types and 55 languages, each annotated with expert-written declarative instructions, positive/negative examples, and millions of input–output instances. Sup-NatInst addresses critical questions regarding the ability of pretrained LLMs to generalize to unseen tasks given only task instructions, providing a public dataset, model, and leaderboard to drive reproducible research on instruction-following models (Wang et al., 2022). Subsequent work by Mishra et al. demonstrates how systematically reformatting these instructions as pseudo-code further enhances the performance of LLMs (Mishra et al., 2023).
1. Motivation and Objectives
Sup-NatInst was conceived in response to limitations of earlier instruction-tuning resources, such as small task pools (e.g., 61 tasks in NaturalInstructions v1.0) and inconsistent task definitions lacking positive/negative demonstration. The principal objectives are:
- To ascertain which factors—model scale, instruction tuning, task diversity—most strongly influence cross-task generalization in LMs.
- To overcome constraints of prior benchmarks by offering ≈1,600 tasks with unified, expert-authored instructions across a comprehensive task taxonomy.
- To foster research on zero-shot and few-shot generalization, especially when evaluating models on tasks held out during training.
- To provide a reproducible framework including a standardized dataset, baselines, model checkpoints (notably Tk-Instruct), and a leaderboard (Wang et al., 2022).
2. Dataset Taxonomy and Construction
Sup-NatInst contains 1,616 tasks grouped into 76 distinct categories spanning classification, extraction, infilling, sequence tagging, text rewriting, text composition, and others. Major subcategories include:
- Classification: sentiment analysis, natural language inference, intent detection.
- Extraction: slot filling, named entity recognition.
- Infilling: masked language modeling, cloze tests.
- Text rewriting and composition: paraphrasing, summarization, title generation.
Each task consists of a declarative natural language instruction, positive and negative examples with explanations, and thousands of input–output pairs. The instructions follow a uniform schema:
- Definition: Natural language description of the mapping from inputs to expected outputs.
- Positive examples: ⟨input, correct output⟩ pairs with brief rationales.
- Negative examples: ⟨input, incorrect output⟩ pairs with brief rationales.
For example, the “RTE/Textual Entailment” task is defined as: “Given two sentences, label ‘1’ if the first entails the second, else ‘0’.” With an instance such as:
- Input: Sentence 1: “No Weapons of Mass Destruction Found in Iraq Yet.” Sentence 2: “Weapons of Mass Destruction Found in Iraq.”
- Output: “0”
Task instructions and demonstrations are serialized as prependable fields, ensuring a unified interface for model input (Wang et al., 2022).
3. Cross-Task Generalization Protocols
Sup-NatInst is explicitly designed to support cross-task generalization: models are trained on instructions for a subset of tasks and evaluated on held-out, unseen tasks (both in English and cross-lingually). Data is partitioned as follows:
- Training set: 1,462 tasks, with 757 for English and 1,271 for cross-lingual after removing task-level leakage.
- Evaluation set: 154 tasks (119 English, 35 cross-lingual), with up to 100 test instances per task.
All instances from evaluation tasks are held out during training. Zero-shot generalization corresponds to using only the task definition as instruction; k-shot corresponds to prepending a small set of positive/negative exemplars (Wang et al., 2022).
4. Evaluation Metrics and Scoring
Multiple aggregation and task-level metrics are employed:
- ROUGE-L: Used as the principal aggregate metric, measuring flexible string overlap between model output and reference.
- Classification metrics: Accuracy, Precision, Recall, F1-score (micro, macro, weighted).
- Correlation: ROUGE-L shows strong correlation with classification accuracy (Pearson across task categories).
- For classification, QA, and generative tasks, additional metrics include BLEU, ROUGE-N, METEOR, and Exact Match (EM).
Aggregate and relative performance improvements are tracked, e.g.:
and
for relative gain when shifting from natural-language to pseudo-code prompts (Mishra et al., 2023).
5. Instructional Formulation: Natural Language and Pseudo-Code
Sup-NatInst instructions are systematically structured, enabling research into instructional modality:
Natural Language Instructions
- Plain, expert-written descriptions.
- Accompanied by structured I/O examples.
- Used as zero-shot model cues and as demonstrations in k-shot settings.
Pseudo-Code Instructions
Mishra et al. (Mishra et al., 2023) extend the benchmark by representing 132 Sup-NatInst tasks as pseudo-code prompts using a four-part schema:
- Function prototype: Typed (PEP 484 style) Python signature.
- Docstring: NL paraphrase, with “Parameters” and “Returns” sections.
- Pseudo-code body: Control-flow, helper predicates, inline comments elucidating reasoning steps.
- Interpreter invocation: “>>>” prompt with real test input.
Concrete zero-shot example for sentiment analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def generate_sentiment(sentence: str) -> str: """For the given sentence, predict the sentiment. Parameters: sentence (str): input sentence Returns: str: “positive” or “negative” """ # check if sentiment is positive if sentiment_is_positive(sentence): return "positive" else: return "negative" >>> generate_sentiment("that has a charmingly bourbon air.") |
Each element (signature, docstring, code structure, comments) is shown to contribute cumulatively to interpretability and model performance (Mishra et al., 2023).
6. Baseline Models, Results, and Scaling Analyses
Sup-NatInst anchors comparative baselines:
Heuristic Baselines:
- Copy input (14–5% ROUGE-L)
- Copy demo output (≈28–50%)
Zero-Shot & Instruction-Tuned LMs:
- T5-LM (11B): 30.2 ROUGE-L (English)
- GPT-3 (175B): 45.0–51.3 ROUGE-L
- T0 (11B): 32.3 ROUGE-L
- InstructGPT (175B): 52.1–52.8 ROUGE-L
- Tk-Instruct (11B, fine-tuned on Sup-NatInst): 62.0 ROUGE-L (English), +9.9 above InstructGPT despite 16× fewer parameters
- mTk-Instruct (13B, multilingual): up to 66.1 ROUGE-L (X-lingual), +13.3 over InstructGPT
| Method | English ROUGE-L | X-lingual ROUGE-L |
|---|---|---|
| Copy Input | 14.2 | 5.4 |
| Copy Demo Output | 28.5 | 50.3 |
| T5-LM (11B) | 30.2 | – |
| GPT-3 (175B) | 45.0 | 51.3 |
| T0 (11B) | 32.3 | – |
| InstructGPT (175B) | 52.1 | 52.8 |
| Tk-Instruct (11B) | 62.0 | – |
| mTk-Instruct (13B) | 57.1 | 66.1 |
Scaling analyses reveal:
- Log-linear ROUGE-L gains with more distinct tasks and increasing model scale.
- Performance saturates after ≈64 instances per task; increasing I/O count beyond this yields diminishing returns.
- Diversity of tasks and model scale are critical for generalization, whereas additional examples per task have lower marginal effect (Wang et al., 2022).
7. Prompting Strategies and Empirical Insights
Recent experiments using Sup-NatInst with pseudo-code prompts show:
- Pseudo-code prompts yield absolute F1 gains of 7–12 points on classification, and 7–19% relative gains in ROUGE-L.
- Code-specialized LMs (CodeGen) excel even on natural language prompts, but pseudo-code yields peak performance.
- For extractive/generative QA, pseudo-code consistently improves results for CodeGen; BLOOM also benefits, especially on MCQ QA.
- Ablations confirm the additive importance of function signature, docstring, comments, and control flow.
- Removing docstrings/comments: −0.027 F1, −0.019 ROUGE-L (CodeGen-6B).
- Adding docstrings/comments to NL prompt: +0.039 F1 (CodeGen-6B).
- Using only function signature or few-shot examples underperforms full pseudo-code prompt (Mishra et al., 2023).
8. Key Findings and Implications
- Instruction fine-tuning on Sup-NatInst leads to substantial cross-task and cross-lingual generalization, as evidenced by Tk-Instruct surpassing much larger models.
- Human evaluations are closely aligned with ROUGE-L metrics (Pearson ), underlining the metric’s fidelity for measuring instruction-driven task completion.
- Model robustness extends to variations in input encoding and instruction format.
- Notable gaps persist between even the best instruction-tuned models and fully supervised upper bounds, leaving room for further advances in instruction learning.
- Pseudo-code prompts further disambiguate task intent and boost model performance, with every prompt element playing a synergistic role.
- Sup-NatInst, together with Tk-Instruct and empirical prompt-style analyses, establishes the leading open benchmark for studying and advancing general-purpose instruction-based NLP modeling (Wang et al., 2022, Mishra et al., 2023).