Papers
Topics
Authors
Recent
Search
2000 character limit reached

How2Mine: Web-Scale Procedural Extraction

Updated 10 February 2026
  • How2Mine is an automated pipeline that extracts structured, goal-conditioned procedures from diverse web documents.
  • It employs multi-stage processing—including stratified sampling, LLM extraction, heuristic and semantic filtering—to ensure high-quality procedural output.
  • The validated dataset is divided into How2Train and How2Bench, providing scalable resources for LLM training and realistic procedural benchmarking.

How2Mine is the web-scale pipeline for mining goal-conditioned, imperative “how-to” procedures as detailed in the How2Everything framework. Its main function is to extract structured, high-quality procedural knowledge from large-scale web data, yielding resources that enable scalable evaluation and improvement of LLMs on realistic planning and procedural execution benchmarks (Chang et al., 9 Feb 2026).

1. Definition and Scope

How2Mine is an automated pipeline designed to extract realistic, goal-conditioned procedures from arbitrary web documents at scale. Each output consists of a clear goal description and an ordered, concrete, imperative step list representing a valid, deterministic plan to achieve the stated goal. The system targets web corpora with broad topical coverage while enforcing entity neutrality and procedural robustness through a sequence of automated filtering and validation stages.

2. Pipeline Structure and Sequential Processing

The How2Mine pipeline operates over web-scale inputs in a precise, multi-stage fashion:

  1. Topical Stratified Sampling Pages are classified by both format and topic using WebOrganizer’s classifiers, restricting the corpus to “Tutorial” and “How-to Guide” formats and enforcing equal sampling across 14 broad topics (e.g., Health, Electronics, Crime & Law). Sampling yields ≈70,000 pages per topic, totaling 980,000 source web pages.
  2. Procedure Extraction via LLM Each sampled document is processed with GPT-4.1, which is prompted to detect and extract any sequential procedure present. Outputs include a goal string and the corresponding stepwise, atomic, imperative list. The yield at this stage is 860,044 procedures (87.8% of input pages).
  3. Heuristics Filtering
    • Step-Count Filter: Procedures must have between 5 and 15 steps.
    • N-gram Repetition Filter: Procedures are rejected if their step lists exhibit excessive intra-procedural repetition, formalized as:

    rn=gGnmax(0,cg1)gGncgr_{n} = \frac{\sum_{g\in G_n}\max(0,\,c_g - 1)}{\sum_{g\in G_n} c_g}

    where GnG_n is the set of all nn-grams, cgc_g is the count of n-gram gg, and rejection thresholds are r20.40r_2 \geq 0.40, r30.35r_3 \geq 0.35, r40.30r_4 \geq 0.30.

Post-filter, 705,952 candidates remain (72.0% of input).

  1. LLM-Based Semantic Filtering Each candidate undergoes a further LLM (GPT-4.1) pass to eliminate examples that:
    • Focus on named entities (brands, specific software, or persons)
    • Consist solely of calculation/mathematical procedures
    • Require UI/element names
    • Are open-ended, creative, or listicle-style
    • Exhibit internal inconsistency or reference errors

After this filter, 418,090 candidates are retained (42.7% of input).

  1. Post-processing and Resource Extraction
    • Goal Rewriting: Ensures specificity and removes ambiguity, as well as extraneous branching or resource-gathering clauses.
    • Resource Extraction: Prompts the LLM for deduplicated external tools, ingredients, or equipment named in steps.
    • Single-Action Enforcement: Each step is pruned to one concise imperative sentence.
  2. Final Sanity-Check Filtering A final LLM filter ensures that surviving candidates satisfy: plausibility of goal achievement, strict sequentiality, entity neutrality, and alignment between goals, resources, and steps. Survivors number 351,162 (35.8% of input) and constitute the validated set.

3. Dataset Characteristics and Output Structure

How2Mine generates 351,162 validated how-to procedures from 980,000 web pages sampled across 14 topics, with coverage balanced by design. The resulting dataset spans 189,000 unique domains. The outputs serve as two downstream resources:

  • How2Train: 344,162 procedures for model training and reinforcement learning
  • How2Bench: 7,000 evaluated references (500 per topic) for benchmarking

Each procedure is structured as a triplet: (goal, resource-list, ordered step list).

Stage Candidates Yield (%) Notes
Extraction 860,044 87.8 LLM identifies goal+steps in sampled docs
Heuristics Filter 705,952 72.0 Step-count, n-gram repetition thresholds
LLM Semantic Filter 418,090 42.7 Removes entity-guided, listicle, creative procedures
Final Sanity-Check 351,162 35.8 Sequentiality, correctness, alignment

4. Algorithmic Details and Formal Criteria

Each pipeline stage enforces rigorous acceptance criteria:

  • Stratified Sampling: Uniform topic coverage via format/topic classifiers
  • Step Filtering:
    • 5steps155 \leq |\text{steps}| \leq 15
    • N-gram repetition as above, to filter templates, menus, or low-diversity instructions
  • Semantic Filtering: LLMs enforce domain transferability and genericity
  • Resource and Goal Processing: LLM-based rewriting for specificity and determinism
  • Pseudocode (as provided):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function How2Mine(docs[1..N]):
    sampled = stratified_sample_by_topic_and_format(docs, topics=14, format='Tutorial & How-to')
    candidates = []
    parallel for doc in sampled:
        out = LLM_extract_goal_and_steps(doc)
        if not out.has_valid_process: continue
        if not (5 <= len(out.steps) <= 15): continue
        if too_much_repetition(out.steps): continue
        if not LLM_filter_semantic(out.goal, out.steps): continue
        (g2, steps2) = LLM_rewrite_goal_and_steps(out.goal, out.steps)
        resources = LLM_extract_resources(g2, steps2)
        if not LLM_final_check(g2, resources, steps2): continue
        candidates.append((g2, resources, steps2))
    return candidates  # ~351K records
The too_much_repetition routine enforces all n-gram repetition thresholds as described.

5. Evaluation Metrics and Quality Assurance

Quality validation employs both automated and manual metrics:

  • Final reference validity: 96.6% (as judged by GPT-4.1 on a held-out reference set)
  • Manual annotation: Krippendorff’s α=0.593\alpha = 0.593 for binary “has_failure/no_failure” (200 example subset, How2Score definitions)
  • LLM judge agreement: Five LLMs achieve 76.5–83.0% agreement with the human majority on semantic filtering
  • How2Judge distillation: 80.5% agreement with humans, 90.5% with GPT-5 as teacher

These results demonstrate high empirical validity of mined procedures and the effectiveness of multi-stage, LLM-centric filtering (Chang et al., 9 Feb 2026).

6. Scalability, Implementation, and Extensibility

The pipeline is embarrassingly parallel: each document is processed independently, and all LLM calls are batchable via the OpenAI API (252,000 GPT-4.1 invocations over the sample set at a reported cost of USD 5,717). All filtering/post-processing steps are suitable for distributed dataflow frameworks (e.g., Spark, Ray).

Extension to new domains or corpus sources (e.g., blogs, Q&A forums) is feasible with only moderate reductions in procedural yield, per the reported performance in alternative source-types. The pipeline’s algorithmic modularity supports efficient scaling to substantially larger or more diverse text corpora.

7. Comparative Perspective and Relation to Prior Work

How2Mine’s approach differs from earlier procedural mining paradigms in its LLM-centric, web-scale filtering and goal-conditioned framing. For comparison, Gupta et al. mine technical troubleshooting procedures from HTML DOM trees using a combination of tree traversal, slot-grammar parsing, and supervised list classification, with decision-aware graph construction. This earlier method focuses on domain-specific, decision-rich technical content, annotating procedure/step/decision boundaries with manually curated features and heuristics (Gupta et al., 2018).

How2Mine instead leverages LLM-based extraction for more heterogeneous, open-domain web corpora, calibrating procedural validity and generalization via semantic and structural constraints, repetition metrics, and LLM-based cross-checking, resulting in orders-of-magnitude broader domain coverage and direct applicability to LLM training and benchmarking (Chang et al., 9 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to How2Mine.