WebInstruct: Large-Scale Instruction Tuning Dataset
- WebInstruct is a comprehensive dataset constructed from web sources using robust multi-stage pipelines for instruction-tuning.
- It employs document retrieval, Q&A extraction, and answer refinement processes to optimize supervised and critique fine-tuning of LLMs.
- Its multimodal variant, VisualWebInstruct, extends the paradigm to visual reasoning, enhancing performance on STEM and domain-specific tasks.
WebInstruct is a family of large-scale, instruction-tuning datasets constructed from web sources with the aim of enabling more effective and data-efficient supervised fine-tuning (SFT) and critique fine-tuning (CFT) of LLMs. WebInstruct datasets are characterized by their breadth across STEM and non-STEM domains, modular construction pipelines involving web-scale retrieval and LLM-based refinement, and their role as a cornerstone resource for instruction following, reasoning, and critique-aware training paradigms.
1. Dataset Construction Pipelines
WebInstruct datasets are defined by robust, multi-stage data acquisition and synthesis protocols that distinguish them from traditional crowd-sourced or LLM-distilled instruction sets.
1.1 WebInstruct (MAmmoTH2)
The primary 10-million-example WebInstruct corpus underpinning MAmmoTH2 (Yue et al., 2024) is mined from Common Crawl via a three-stage process:
- Document Recall: fastText classifiers are trained on seed data (100K positive Q&A pages and 100K negatives), then applied to pre-training-scale web corpora to identify educational Q&A domains. Domains are further curated via GPT-4 domain assessment.
- Q&A Extraction: Qwen-72B is prompted with cleaned HTML for natural question–answer pair extraction. Boilerplate and known benchmark contamination are filtered.
- Answer Refinement: Mixtral-22B and Qwen-72B perform response reformulation, formality correction, and explicit chain-of-thought injection if necessary.
1.2 WebInstruct Critique Subsets (CFT)
The WebInstruct-CFT variants (Wang et al., 29 Jan 2025) are constructed specifically for critique fine-tuning:
- Base Corpus: Q&A data scraped from web forums and QA platforms, initially refined by Mixtral and Qwen-72B, with a high noise rate (estimated >50%).
- Critique Generation: GPT-4o-1120 is used to verify responses (WebInstruct-verified), provide high-quality answers (WebInstruct-GPT-4o), and generate detailed critiques for noisy responses (WebInstruct-CFT), creating triplets of (query, noisy response, critique).
- Subset Stratification: Subsets of 50K examples (plus a 4K “Tiny” variant) are selected for verified, GPT-4o, and critique-annotated data.
1.3 Web Reconstruction (WebR)
The WebR pipeline (Jiang et al., 22 Apr 2025) synthesizes instruction pairs using a dual-perspective paradigm—“Web as Instruction” and “Web as Response”—from arbitrary web documents, employing LLMs (e.g., GPT-4o-mini, Llama3-70B-Instruct) for both instruction and response construction via rewrite and latent-inference prompts. MinHash deduplication ensures diversity.
1.4 VisualWebInstruct
VisualWebInstruct (Jia et al., 13 Mar 2025) extends the paradigm to multimodal instruction data:
- Seed Image Selection: 30K STEM images from open repositories serve as queries for Google Image Search, yielding ≈1.7M raw URLs.
- HTML and Accessibility-Tree Processing: Extraction pipelines preserve text–image structure and prune non-educational content.
- QA Extraction: Gemini 1.5-Flash and GPT-4o are used for QA pair extraction, answer refinement, and alignment.
- Post-Processing: Consistency filtering and high-confidence answer alignment are applied, resulting in over 900K multimodal QA pairs.
2. Dataset Structure and Annotation Schema
WebInstruct and its derivatives employ structured data schemas for instruction-tuning and critique training.
2.1 Instruction–Response Format
The core SFT data is a set of pairs: where the instruction spans mathematics, science, engineering, humanities, or general knowledge tasks, and the response is either web-mined or LLM-refined for clarity, formality, and explicit reasoning steps.
2.2 Critique Fine-Tuning Triplets
The critique dataset adheres to a triple format: where is the original instruction, the noisy/unverified answer, and the step-by-step critique generated by GPT-4o. Critiques typically identify correctness, reason about errors or omissions, and conclude with a correctness label.
2.3 Representative Example
$\begin{aligned} &\text{Input: }[x; y] = [\text{"Compute } \sum_{k=1}^{10} k^2\text{"};\,\text{"The sum is }385."}]\ &\text{Output (critique) }c: \ &\text{"Your answer is correct, but you should show the derivation. For instance, use }\ &\ \sum_{k=1}^n k^2=\frac{n(n+1)(2n+1)}{6}\text{, which for } n=10 \text{ gives }385."} \end{aligned}$
2.4 Metadata
Subsets track:
- response correctness (e.g., 56% correct, 44% wrong in CFT)
- source (WebInstruct, GPT-4o, verified, CFT, CFT-Tiny)
- token lengths (mean query ≈ 30, response ≈ 80, critique ≈ 120)
- image associations for VisualWebInstruct
3. Corpus Statistics and Domain Coverage
Distinct WebInstruct variants exhibit large scale and broad topicality.
| Subset | Size (Pairs) | Modalities | Domain Emphasis |
|---|---|---|---|
| WebInstruct (MAmmoTH2) | 10,000,000 | text | Math, Science, Engineering, Humanities |
| WebInstruct-CFT | 50,000 | text, with critiques | ~65% Math, 8% Physics, 4% Chemistry, 10% Business, 4% Humanities, rest diverse |
| VisualWebInstruct | 906,160 | text, image | 62.5% Math, 14.5% Physics, 7.25% Finance, 4.8% Chemistry, 4.35% Engineering |
| WebR-Basic | 100,000 | text | 70% General, 15% Math, 15% Code |
VisualWebInstruct includes 347,313 image-associated QA pairs and covers 163,743 unique images (Jia et al., 13 Mar 2025).
4. Training Objectives, Fine-Tuning Protocols, and Impact
4.1 SFT and CFT Training Losses
- SFT Loss (Instruction–Response):
- CFT Loss (Critique Generation):
(Yue et al., 2024, Wang et al., 29 Jan 2025)
4.2 Empirical Impact
- Training Qwen2.5-Math-7B with CFT (50K CFT samples, 1 epoch) yields a 4–10 point accuracy increase over best SFT baselines on six reasoning benchmarks (MATH, Minerva-Math, GSM8K, OlympiadBench, AIME24, AMC23).
- WebInstruct enables sub-10B LLMs to achieve state-of-the-art or near-SOTA results on benchmarks such as GSM8K and MATH, with MAmmoTH2-7B’s accuracy increasing from 11.2% to 34.2% (MATH) and from 36.2% to 67.4% (GSM8K) in zero-shot scenarios (Yue et al., 2024, Wang et al., 29 Jan 2025).
- Ablation studies show CFT’s data efficiency and robustness to critique source/model (Wang et al., 29 Jan 2025).
- VisualWebInstruct fine-tuning leads MAmmoTH-VL2 to SOTA among 7–10B VLMs on MMMU-Pro (40.7), MathVerse (42.6), DynaMath (55.7) (Jia et al., 13 Mar 2025).
5. Relation to Other Instruction-Tuning Approaches
WebInstruct establishes distinct advantages:
- Scale: 10 million pairs (WebInstruct/MAmmoTH2) exceed prior SFT sets, nearly 10× OpenMathInstruct (1.8M) or XwinMath (1.4M).
- Naturalness: All base pairs originate from public, human-authored Q&A or educational content.
- Critique Supervision: Provides critique-annotated triplets (unique among SFT/IT datasets).
- Modality Diversity: VisualWebInstruct expands SFT coverage to multimodal visual–text input.
- Pipeline Transparency: Documented multi-stage extraction/refinement, domain filtering, and decontamination protocols (Yue et al., 2024).
6. Applications and Limitations
6.1 Major Use Cases
- Training LLMs for high-fidelity instruction following across STEM, business, and humanities.
- Critique fine-tuning to improve error identification and correction (WebInstruct-CFT), facilitating LMs that “think about” as well as “produce” answers (Wang et al., 29 Jan 2025).
- Multimodal instruction-tuning for visual reasoning and solution explanation (VisualWebInstruct) (Jia et al., 13 Mar 2025).
- Data-efficient domain adaptation and specialization for mathematical, scientific, and technical sub-domains (Jiang et al., 22 Apr 2025).
6.2 Known Limitations
- Minor errors present in model-generated critiques (≈20% in CFT).
- Lack of human-verified critique; all annotation synthesized by GPT-4o.
- Non-STEM domains (e.g., humanities, social sciences) are under-represented; downstream performance is untested in these areas.
- For CFT variants, inference does not involve self-critique; CFT is used only for training supervision (Wang et al., 29 Jan 2025).
- VisualWebInstruct leverages black-box Google Lens image similarity, with no open scoring function (Jia et al., 13 Mar 2025).
7. Code, Data Availability, and Reproducibility
- MAmmoTH2 WebInstruct and pipelines: https://tiger-ai-lab.github.io/MAmmoTH2/
- WebR datasets, prompt templates, and code: https://github.com/YJiangcm/WebR
- All datasets describe comprehensive filtering and release protocols supporting reproducibility (Yue et al., 2024, Jiang et al., 22 Apr 2025).
WebInstruct, across its variants, represents a new standard for large-scale, natural, and critique-supplemented instruction-tuning data, enabling LLM research to decouple reasoning improvement from scale, human annotation, or heavy reliance on LLM-generated synthetic seed sets.