Papers
Topics
Authors
Recent
Search
2000 character limit reached

WebInstruct: Large-Scale Instruction Tuning Dataset

Updated 16 March 2026
  • WebInstruct is a comprehensive dataset constructed from web sources using robust multi-stage pipelines for instruction-tuning.
  • It employs document retrieval, Q&A extraction, and answer refinement processes to optimize supervised and critique fine-tuning of LLMs.
  • Its multimodal variant, VisualWebInstruct, extends the paradigm to visual reasoning, enhancing performance on STEM and domain-specific tasks.

WebInstruct is a family of large-scale, instruction-tuning datasets constructed from web sources with the aim of enabling more effective and data-efficient supervised fine-tuning (SFT) and critique fine-tuning (CFT) of LLMs. WebInstruct datasets are characterized by their breadth across STEM and non-STEM domains, modular construction pipelines involving web-scale retrieval and LLM-based refinement, and their role as a cornerstone resource for instruction following, reasoning, and critique-aware training paradigms.

1. Dataset Construction Pipelines

WebInstruct datasets are defined by robust, multi-stage data acquisition and synthesis protocols that distinguish them from traditional crowd-sourced or LLM-distilled instruction sets.

1.1 WebInstruct (MAmmoTH2)

The primary 10-million-example WebInstruct corpus underpinning MAmmoTH2 (Yue et al., 2024) is mined from Common Crawl via a three-stage process:

  • Document Recall: fastText classifiers are trained on seed data (100K positive Q&A pages and 100K negatives), then applied to pre-training-scale web corpora to identify educational Q&A domains. Domains are further curated via GPT-4 domain assessment.
  • Q&A Extraction: Qwen-72B is prompted with cleaned HTML for natural question–answer pair extraction. Boilerplate and known benchmark contamination are filtered.
  • Answer Refinement: Mixtral-22B and Qwen-72B perform response reformulation, formality correction, and explicit chain-of-thought injection if necessary.

1.2 WebInstruct Critique Subsets (CFT)

The WebInstruct-CFT variants (Wang et al., 29 Jan 2025) are constructed specifically for critique fine-tuning:

  • Base Corpus: Q&A data scraped from web forums and QA platforms, initially refined by Mixtral and Qwen-72B, with a high noise rate (estimated >50%).
  • Critique Generation: GPT-4o-1120 is used to verify responses (WebInstruct-verified), provide high-quality answers (WebInstruct-GPT-4o), and generate detailed critiques for noisy responses (WebInstruct-CFT), creating triplets of (query, noisy response, critique).
  • Subset Stratification: Subsets of 50K examples (plus a 4K “Tiny” variant) are selected for verified, GPT-4o, and critique-annotated data.

1.3 Web Reconstruction (WebR)

The WebR pipeline (Jiang et al., 22 Apr 2025) synthesizes instruction pairs using a dual-perspective paradigm—“Web as Instruction” and “Web as Response”—from arbitrary web documents, employing LLMs (e.g., GPT-4o-mini, Llama3-70B-Instruct) for both instruction and response construction via rewrite and latent-inference prompts. MinHash deduplication ensures diversity.

1.4 VisualWebInstruct

VisualWebInstruct (Jia et al., 13 Mar 2025) extends the paradigm to multimodal instruction data:

  • Seed Image Selection: 30K STEM images from open repositories serve as queries for Google Image Search, yielding ≈1.7M raw URLs.
  • HTML and Accessibility-Tree Processing: Extraction pipelines preserve text–image structure and prune non-educational content.
  • QA Extraction: Gemini 1.5-Flash and GPT-4o are used for QA pair extraction, answer refinement, and alignment.
  • Post-Processing: Consistency filtering and high-confidence answer alignment are applied, resulting in over 900K multimodal QA pairs.

2. Dataset Structure and Annotation Schema

WebInstruct and its derivatives employ structured data schemas for instruction-tuning and critique training.

2.1 Instruction–Response Format

The core SFT data is a set of pairs: (x,y)=(instruction,response)(x, y) = (\text{instruction}, \text{response}) where the instruction spans mathematics, science, engineering, humanities, or general knowledge tasks, and the response is either web-mined or LLM-refined for clarity, formality, and explicit reasoning steps.

2.2 Critique Fine-Tuning Triplets

The critique dataset adheres to a triple format: (x,y,c)(x, y, c) where xx is the original instruction, yy the noisy/unverified answer, and cc the step-by-step critique generated by GPT-4o. Critiques typically identify correctness, reason about errors or omissions, and conclude with a correctness label.

2.3 Representative Example

$\begin{aligned} &\text{Input: }[x; y] = [\text{"Compute } \sum_{k=1}^{10} k^2\text{"};\,\text{"The sum is }385."}]\ &\text{Output (critique) }c: \ &\text{"Your answer is correct, but you should show the derivation. For instance, use }\ &\ \sum_{k=1}^n k^2=\frac{n(n+1)(2n+1)}{6}\text{, which for } n=10 \text{ gives }385."} \end{aligned}$

(Wang et al., 29 Jan 2025)

2.4 Metadata

Subsets track:

  • response correctness (e.g., 56% correct, 44% wrong in CFT)
  • source (WebInstruct, GPT-4o, verified, CFT, CFT-Tiny)
  • token lengths (mean query ≈ 30, response ≈ 80, critique ≈ 120)
  • image associations for VisualWebInstruct

3. Corpus Statistics and Domain Coverage

Distinct WebInstruct variants exhibit large scale and broad topicality.

Subset Size (Pairs) Modalities Domain Emphasis
WebInstruct (MAmmoTH2) 10,000,000 text Math, Science, Engineering, Humanities
WebInstruct-CFT 50,000 text, with critiques ~65% Math, 8% Physics, 4% Chemistry, 10% Business, 4% Humanities, rest diverse
VisualWebInstruct 906,160 text, image 62.5% Math, 14.5% Physics, 7.25% Finance, 4.8% Chemistry, 4.35% Engineering
WebR-Basic 100,000 text 70% General, 15% Math, 15% Code

VisualWebInstruct includes 347,313 image-associated QA pairs and covers 163,743 unique images (Jia et al., 13 Mar 2025).

4. Training Objectives, Fine-Tuning Protocols, and Impact

4.1 SFT and CFT Training Losses

  • SFT Loss (Instruction–Response):

LSFT(θ)=(x,y)Dt=1ylogPθ(yty<t,x)\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(x, y) \in D} \sum_{t=1}^{|y|} \log P_\theta(y_t \mid y_{<t}, x)

  • CFT Loss (Critique Generation):

L(θ)=(x,y,c)WebInstruct-CFTlogPθ(c[x;y])\mathcal{L}(\theta) = -\sum_{(x, y, c) \in \mathrm{WebInstruct\text{-}CFT}} \log P_\theta(c \mid [x ; y])

(Yue et al., 2024, Wang et al., 29 Jan 2025)

4.2 Empirical Impact

  • Training Qwen2.5-Math-7B with CFT (50K CFT samples, 1 epoch) yields a 4–10 point accuracy increase over best SFT baselines on six reasoning benchmarks (MATH, Minerva-Math, GSM8K, OlympiadBench, AIME24, AMC23).
  • WebInstruct enables sub-10B LLMs to achieve state-of-the-art or near-SOTA results on benchmarks such as GSM8K and MATH, with MAmmoTH2-7B’s accuracy increasing from 11.2% to 34.2% (MATH) and from 36.2% to 67.4% (GSM8K) in zero-shot scenarios (Yue et al., 2024, Wang et al., 29 Jan 2025).
  • Ablation studies show CFT’s data efficiency and robustness to critique source/model (Wang et al., 29 Jan 2025).
  • VisualWebInstruct fine-tuning leads MAmmoTH-VL2 to SOTA among 7–10B VLMs on MMMU-Pro (40.7), MathVerse (42.6), DynaMath (55.7) (Jia et al., 13 Mar 2025).

5. Relation to Other Instruction-Tuning Approaches

WebInstruct establishes distinct advantages:

  1. Scale: 10 million pairs (WebInstruct/MAmmoTH2) exceed prior SFT sets, nearly 10× OpenMathInstruct (1.8M) or XwinMath (1.4M).
  2. Naturalness: All base pairs originate from public, human-authored Q&A or educational content.
  3. Critique Supervision: Provides critique-annotated triplets (unique among SFT/IT datasets).
  4. Modality Diversity: VisualWebInstruct expands SFT coverage to multimodal visual–text input.
  5. Pipeline Transparency: Documented multi-stage extraction/refinement, domain filtering, and decontamination protocols (Yue et al., 2024).

6. Applications and Limitations

6.1 Major Use Cases

  • Training LLMs for high-fidelity instruction following across STEM, business, and humanities.
  • Critique fine-tuning to improve error identification and correction (WebInstruct-CFT), facilitating LMs that “think about” as well as “produce” answers (Wang et al., 29 Jan 2025).
  • Multimodal instruction-tuning for visual reasoning and solution explanation (VisualWebInstruct) (Jia et al., 13 Mar 2025).
  • Data-efficient domain adaptation and specialization for mathematical, scientific, and technical sub-domains (Jiang et al., 22 Apr 2025).

6.2 Known Limitations

  • Minor errors present in model-generated critiques (≈20% in CFT).
  • Lack of human-verified critique; all annotation synthesized by GPT-4o.
  • Non-STEM domains (e.g., humanities, social sciences) are under-represented; downstream performance is untested in these areas.
  • For CFT variants, inference does not involve self-critique; CFT is used only for training supervision (Wang et al., 29 Jan 2025).
  • VisualWebInstruct leverages black-box Google Lens image similarity, with no open scoring function (Jia et al., 13 Mar 2025).

7. Code, Data Availability, and Reproducibility

WebInstruct, across its variants, represents a new standard for large-scale, natural, and critique-supplemented instruction-tuning data, enabling LLM research to decouple reasoning improvement from scale, human annotation, or heavy reliance on LLM-generated synthetic seed sets.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WebInstruct Dataset.