Papers
Topics
Authors
Recent
Search
2000 character limit reached

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Published 9 Feb 2026 in cs.LG | (2602.08808v1)

Abstract: Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

Summary

  • The paper introduces a web-scale framework (How2Everything) that mines, benchmarks, and optimizes procedural generation in LLMs.
  • It utilizes multi-stage filtering and RL-based optimization with LLM-as-a-judge to ensure procedural validity and detect critical failures.
  • Results indicate measurable improvements across model scales and diverse domains, enhancing real-world applicability of step-by-step instructions.

Systematic Evaluation and Improvement of Procedural Generation in LLMs with How2Everything

Motivation and Problem Setting

The production of reliable step-by-step procedures is a critical capability for LLMs, tightly coupled to real-world usability in domains ranging from troubleshooting and education to practical planning. Procedure generation exhibits unique evaluation challenges: accuracy demands correct completion of multi-step dependencies, domain coverage, execution feasibility, and avoidance of critical errors. Prior datasets and benchmarks for procedural generation suffer from constrained topical scope (often limited to domains like cooking) and unreliable evaluation metrics (string overlap or single-reference likelihood), limiting robust capability analysis and scalable optimization.

The How2Everything Framework

How2Everything operationalizes a web-scale pipeline to mine, evaluate, and optimize goal-conditioned procedural knowledge:

  • How2MINE: Large-scale extraction of procedures from 980K web documents, stratified across 14 diverse topics, yielding a pool of 351K high-quality procedure instances. Multi-stage filtering (LLM-based and heuristics) ensures elimination of entity-specific, non-sequential, trivial, or nonsensical procedures and standardizes the format for evaluation and training.
  • How2BENCH: Evaluation corpus constructed with balanced topical coverage (7K examples, 500/topic) from How2MINE, designed for benchmarking procedural validity across scales, from small (1B) to frontier models.
  • How2TRAIN: Remaining mined examples serve as a heterogeneous, large-scale procedural training resource.
  • How2SCORE and How2JUDGE: Critical-failure detection protocol leveraging LLM-as-a-judge, formalizing evaluation as binary classification of "has critical failure" vs. "no failure," anchored to procedural reference. How2JUDGE, an open 8B model distilled from GPT 5 annotations, yields 80.5% agreement with human annotators, enabling reproducible, cost-effective evaluation and RL reward signals.

Evaluation Methodology

Procedural validity is robustly defined via detection of omissions, contradictions, severe vagueness, and extraneous actions that preclude goal achievement or render instructions unusable, with anchor reference procedures serving as evaluation targets. Human validation of critical failure detection achieves moderate inter-annotator agreement (α=0.593\alpha=0.593), supporting use as a reliable RL reward and ranking metric. LLM judges exhibit high agreement with human labels (up to 83%) and minimal self-preference bias vis-à-vis model family, supporting cross-family evaluations and performance comparisons.

Benchmarking Procedural Capability

How2BENCH demonstrates strong scaling properties:

  • Performance correlates with model size and training progression, with measurable gains emerging early in pretraining (5% FLOPs for 1B models) and continuing through subsequent stages.
  • Score distributions by reference step count (length) reveal monotonic increases in difficulty with more required steps. Logistic regression analyses confirm reference step count as the dominant predictor of failure rates, superseding resource count or residual verbosity.
  • Benchmark ordering is not reducible to reference-step perplexity; Spearman rank correlation between How2Score and perplexity weakens in large models, indicating that procedural validity is distinct from reference-likelihood optimization.

Optimization via RL

How2Everything enables closed-loop optimization with scalable, low-cost RL:

  • RL using How2SCORE as reward, evaluated by How2JUDGE, produces consistent improvements: e.g., Qwen3-8B-Instruct improves by +10.1+10.1 points on How2Bench post-RL with no systematic out-of-domain regressions, including robustness to external judges.
  • Length-control during RL prevents gaming through verbosity, keeping average generation-token ratios tightly bounded.
  • RL gains depend on pretraining maturity and broad topical training coverage, confirming that improvements are not driven solely by format imitation or data contamination.

Robustness to Confounds

Targeted analyses invalidate alternative explanations for observed improvements:

  • Format Compliance: Gains increase substantially only with strong pretraining checkpoints and sufficient topic diversity, indicating content-level (not format-level) enhancement.
  • Source Memorization: Aggressive document contamination during midtraining results in sharp drops in perplexity, but How2Score only modestly increases (+3+3 for 7B, +6.1+6.1 for 32B models), eliminating memorization as the dominant factor.
  • Residual Verbosity Bias: Positive association between generation/reference token ratio and no_failure odds, even under explicit length control, supports reporting average token counts with all scores.

Qualitative Diagnostics

Large-scale qualitative analysis reveals canonical failure patterns: critical omissions of prerequisite actions, vague or underspecified operational parameters, unsafe actions, internal contradictions, and refusals. Successful generations exhibit actionable precision, correct sequencing, and minimal critical deviations, even when phrasing diverges from the reference.

Practical and Theoretical Implications

How2Everything establishes a closed-loop paradigm for procedural capability development: web-mined data enables rigorous evaluation and optimization, bypassing the need for expensive human annotation or unreliable overlap-based metrics. The framework provides practical scalability to model selection, ranking, and forecasting downstream performance. Procedural skill is shown to be a core generalizable capability (observable early in training), making it a promising auxiliary RL signal with tangible effects across unrelated domains.

Potential risks include the propagation of societal biases or unsafe procedural content from web-mined data, and the impact of applying LLM-evaluated metrics as proxies for real-world correctness in safety-critical applications. Mitigation strategies include topic-level safety filtering and explicit documentation of evaluation limitations. Release of evaluation splits, prompt templates, and judge model weights supports transparent, reproducible benchmarking.

Conclusion

How2Everything addresses the longstanding challenges of procedural generation evaluation and optimization in LLMs through web-scale corpus mining, validity-oriented LLM-as-a-judge protocols, and systematic RL-based improvement. The framework unlocks scalable analysis and enhancement of goal-conditioned procedure generation, facilitating domain-agnostic benchmarks, robust training signals, and practical advances for user-facing LLM applications (2602.08808).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is about teaching and testing AI chatbots on “how-to” tasks—step-by-step instructions for doing something, like cooking a recipe, fixing a gadget, or handling a legal form. The authors build a big, reusable system called “How2Everything” to collect real-world how-to procedures from the web, judge whether AI-generated steps would actually work, and then use that judgment to help AIs get better at writing instructions.

What questions are the researchers trying to answer?

Here are the main questions, explained simply:

  • How can we gather lots of diverse, real-world step-by-step guides from the web to use for training and testing AI?
  • How can we judge whether an AI’s step-by-step instructions are actually usable and would let someone achieve the goal?
  • Do bigger and better-trained AIs write more reliable how-to instructions?
  • Can we use this judging system as feedback to improve AI instruction writing without breaking other skills (like math, coding, or general knowledge)?
  • Are improvements real, or are AIs just memorizing web pages or learning a “nice format” without truly better content?

How did they do it?

The authors built a framework with four main parts. Think of it like building a giant “how-to” library, a fair referee, a test set, and a training program.

1) Collecting step-by-step guides from the web (How2MINE)

They scan about 980,000 web pages that look like tutorials and extract clean, structured procedures across 14 topics (like Food & Dining, Health, Electronics, Science & Tech, and more). After filtering and cleaning, they end up with 351,000 usable procedures.

To make each example clear and testable, they include:

  • The goal (what you’re trying to do)
  • A list of resources (tools or ingredients mentioned in the steps)
  • The reference steps (the original step-by-step instructions from the web page)

From this big pool:

  • They build an evaluation set called How2Bench (7,000 examples; 500 per topic)
  • They create a training set called How2Train (the remaining 300k+ examples)

2) Judging AI-written procedures for “critical failures” (How2SCORE + How2JUDGE)

They define a “critical failure” as a mistake that would stop you from reaching the goal or make the instructions unusable. Examples:

  • Skipping a required waiting period in a legal process
  • Leaving out key parameters like cooking time or temperature, so you can’t actually finish the recipe
  • Doing unsafe or contradictory steps

They use a strong AI judge (a frontier model) to check AI outputs for these failures and then “distill” its judging skills into a smaller, open model called How2Judge (8 billion parameters). Distilling means training a smaller model to mimic a bigger one so it’s cheaper and easier to use. How2Judge agrees with human annotators about 80.5% of the time, which is good enough for large-scale testing and training.

3) Testing lots of AIs (How2Bench)

They ask many different AI models to write procedures for the same goals and score them with the judge. Models get a “success rate” based on how often their instructions contain no critical failures. They control the number of steps (so every model outputs the same count) to keep comparisons fair.

4) Improving AIs using the judge’s feedback (Reinforcement Learning)

They use the judge’s score as a reward signal in reinforcement learning (RL), a training method where the AI tries things, gets feedback, and adjusts. They add simple format checks and length control so models don’t “game” the score by being overly long or flowery. The result: models get better at writing usable how-to procedures.

What did they find, and why does it matter?

Main findings

  • Scale and diversity: They mined 351,000 real-world procedures across 14 topics from 980,000 web pages. That’s a huge, diverse dataset.
  • Reliable judging at low cost: Their small judge (How2Judge) matches human labels about 80.5% of the time, making it practical to use at scale.
  • Clear performance trends: On How2Bench, bigger and better-trained models make fewer critical mistakes. The benchmark shows smooth “scaling” behavior and can track improvements even early in training.
  • RL makes models better at how-to writing: Using the judge’s feedback, models improved their How2Bench scores by more than 10 points across three different models.
  • No broad harm to other skills: After RL, models did not consistently get worse on 12 external benchmarks covering math, coding, knowledge, and reasoning. Some scores even improved.
  • Improvements aren’t just formatting or memorization: Gains weren’t mainly due to learning a neat style or repeating training data. For example, even when source documents were shown multiple times during training, the how-to scores only rose modestly.

Why it matters

  • How-to guidance is a common use for chatbots. Making this skill measurable—and trainable—helps build assistants that give safer, more reliable instructions.
  • The judge focuses on “would this actually work?” rather than word overlap or surface niceness. That means better real-world usefulness.
  • The approach creates a “closed loop”: collect web data → evaluate models reliably → use the evaluation to improve models → test again. This loop can push steady progress.

What methods or terms might feel technical?

Here are a few, explained with simple analogies:

  • LLM: A very advanced text-predicting AI. Think of it as a super smart autocomplete that can write full answers.
  • Benchmark: A standardized test set. Like a consistent quiz used to compare different students (models).
  • Critical failure: A serious mistake that would prevent success. Imagine a recipe that never tells you to bake the cake—no matter how pretty the instructions look, you won’t get a cake.
  • Distillation: Teaching a smaller AI to judge like a bigger, more expensive AI. Like a trainee learning from a master so the trainee can handle day-to-day work cheaply.
  • Reinforcement Learning (RL): Training with feedback signals. Think of a coach giving a score after each try; the player (the model) adjusts to get higher scores next time.

What could this change in the real world?

  • More dependable AI assistants: Better, step-by-step instructions for everyday tasks (cooking, DIY, travel), troubleshooting (electronics, home repairs), and planning (study routines, fitness).
  • Safer, clearer guidance: The judge penalizes missing steps, contradictions, and unsafe actions, nudging models toward practical instructions.
  • A reusable process: Other teams can use the data, the judge, and the protocol to build and improve how-to capabilities in their own models.

Final takeaway

The authors show that collecting huge numbers of real-world how-to procedures and judging them for “will this actually work?” can both test and improve AI models at scale. Their system helps AIs write better step-by-step instructions without harming other skills, and it moves us toward chatbots that are more useful and trustworthy for everyday tasks.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored, framed to be actionable for future work:

  • Executable validation gap: No correlation study between How2SCORE and actual task success under execution (simulation, robotic/GUI environments, or expert enactment); design execution-grounded subsets to calibrate the judge and metric.
  • Limited human validation: Only 200 examples with non-expert annotators and moderate agreement (α≈0.59); expand to larger, topic-stratified human (and domain-expert) panels with confidence/rationale capture and report CIs.
  • Reference anchoring risk: References may contain errors and are treated as anchors; quantify how often valid alternative procedures are wrongly penalized and introduce multi-reference/equivalence-class judging.
  • Binary metric coarseness: A single “any critical failure” label loses severity and partial-credit signal; develop severity-weighted scores, per-step localization, and PRM-style labels for procedural tasks.
  • Judge generalization and bias: How2JUDGE is distilled from a single teacher and trained on 73K labels; stress-test cross-domain and cross-family biases, style/verbosity sensitivity beyond length, and robustness to adversarial generations that target judge weaknesses.
  • External validation breadth: Gains were checked on a small set of external judges; broaden to judge ensembles (including open-source), mixed human+LLM meta-judging, and report stability across evaluators.
  • Reward hacking beyond length: Length control is addressed, but other gaming avenues (e.g., over-hedging, generic safety disclaimers, superficial parameter stuffing) are untested; build targeted adversarial evaluations and reward guards.
  • Evaluation controls vs natural use: Models are given resource lists and exact step counts, which users typically don’t provide; measure performance without R and n, add a step-count estimation task, and evaluate robustness to omitted/incorrect resources.
  • Coverage exclusions: Pipeline filters out UI-based procedures, named-entity–specific tasks, and <5 or >15 step tasks; extend mining to digital/GUI tasks, brand/jurisdiction-specific instructions, and very short/long or hierarchical procedures.
  • Long-horizon and branching structures: Current evaluation is linear; introduce tasks with conditional branches, loops, and interdependencies, and judge correctness of conditionality (e.g., if/else prerequisites).
  • Safety characterization: No targeted analysis of safety-critical failure modes (health/legal/industrial); create safety-graded subsets, risk-weight critical failures, and measure whether RL increases unsafe assertiveness or suppresses prudent warnings.
  • Domain expertise: Human validation avoided specialized topics, yet benchmark includes them; build expert-annotated test slices (health/legal/industrial) and compare expert vs generalist judgments and judges’ calibration.
  • Memorization/confounding: Memorization tests focus on midtraining; extend to pretraining, publish document URL/hash overlap stats, and perform per-example contamination audits between model corpora and How2TRAIN/BENCH.
  • Topic and cultural bias: The 14-topic, English, web-mined data may be geographically/culturally skewed; audit domain/geography distributions, add multilingual procedures, and test jurisdiction-specific legal/health variability.
  • Data licensing and privacy: Web-mined content raises licensing/PII questions; detail licenses, PII filtering efficacy, and provide a replicable, legally-compliant release (URLs/hashes, removal requests).
  • Mining reproducibility: Mining relies on closed APIs (GPT-4.1) for multiple stages; provide open-model extractors/filters, report accuracy/cost trade-offs, and quantify delta in dataset quality vs closed-API versions.
  • Judge training transparency: Release (or summarize) the 73K judge-training annotations (with decontamination proofs) and ablate teacher choices to quantify teacher bias imprinting on the distilled judge.
  • OOD impact breadth: OOD assessments cover 12 benchmarks but omit alignment metrics (toxicity, helpfulness, factuality, calibration); add these and perform user studies on helpfulness and trust post-RL.
  • RL stability and design: Only GRPO and one reward mix were tested; run ablations over algorithms, reward weights, seeds, and curriculum schedules; assess catastrophic forgetting and long-run stability.
  • Tool-augmented agents: No evaluation of retrieval/tool use for reducing critical failures; benchmark agentic variants (retrieval, calculators, checklists) and measure judge/human-validated gains.
  • Difficulty modeling: Step count correlates with difficulty, but other factors (parameter density, prerequisite chaining, domain hazard level) are unmodeled; create calibrated difficulty strata and report per-stratum performance.
  • Resource extraction fidelity: Resource lists are derived from references and may be incomplete; audit extraction errors and modify judging to accept justified additions of missing but necessary resources.
  • Per-topic diagnostics: Beyond aggregate scores, publish per-topic failure taxonomies and ablate topic-balanced vs real-world frequency-weighted evaluation to understand transfer and deployment relevance.
  • Naturalness and usability: Evaluation enforces one concise sentence per step; measure effects of more natural, multi-sentence steps, inline rationales, and formatting variants on judge agreement and human usability.
  • Alternative-validity stress tests: Construct “diversity” sets where multiple distinct procedures are valid; measure judge tolerance to procedural diversity and develop prompts/training to reduce over-anchoring to references.
  • Multimodal procedures: Many how-tos are image/video-first; extend mining and evaluation to multimodal inputs/outputs and test whether multimodality reduces critical failures.
  • Scaling laws and forecasting: Formalize performance–compute/token scaling fits with uncertainty bands, compare to perplexity-based predictors, and test cross-run generalization beyond OLMo trajectories.

Practical Applications

Below, the paper’s contributions (How2MINE, How2BENCH, How2SCORE, How2JUDGE, and RL with How2SCORE) are translated into practical applications with sectors, deployability, and feasibility notes.

Immediate Applications

  • CI-grade benchmark for procedural quality in LLM training and model selection
    • Sector: software/AI platforms and model labs
    • Application: Integrate How2BENCH + How2SCORE (with the open 8B How2JUDGE) into continuous evaluation to track scaling trends, catch regressions, and gate releases; slice difficulty by step count to diagnose failure modes early in pretraining/midtraining.
    • Tools/workflows: Add a “procedural validity” job to eval harnesses; nightly dashboards; per-topic scorecards across 14 topics.
    • Assumptions/dependencies: Access to How2BENCH and How2JUDGE; acceptance that 80.5% human agreement is sufficient for gating; enforce length/format controls to avoid verbosity gaming.
  • RL-based improvement of how-to generation without broad OOD regressions
    • Sector: software/AI training
    • Application: Use How2SCORE (scored by How2JUDGE) as a reward in GRPO/PPO-style RL to boost end-to-end procedural validity (+10 points reported) while maintaining standard benchmark performance.
    • Tools/workflows: Extend post-training pipelines (RLHF/RLAIF) with an additional How2SCORE reward head; include a length-control reward and a format verifier to prevent gaming.
    • Assumptions/dependencies: Compute budget for RL; careful reward design (length control, format checks); safety review of reward-induced behaviors.
  • Low-cost LLM judge for production guardrails on how-to responses
    • Sector: consumer assistants, enterprise copilots, helpdesk chat
    • Application: Run-time screening of generated procedures for “critical failures” before display; auto-regenerate until no_failure or escalate to human.
    • Tools/workflows: On-call guardrail microservice using How2JUDGE; retry-and-revise loops with targeted feedback from detected failure types (omission, unsafe step, contradiction).
    • Assumptions/dependencies: Latency/SLA budgets for a second-pass judge; judge bias and 80.5% agreement imply residual risk—should be paired with safety policies.
  • SOP and runbook quality checking (“procedure linting”)
    • Sector: DevOps/IT, manufacturing, healthcare operations, customer support
    • Application: Batch-scan standard operating procedures, knowledge-base articles, and runbooks to flag critical omissions or contradictions; improve handoffs and on-call playbooks.
    • Tools/workflows: CMS/Confluence/GitHub Actions plugin (“How2 Lint”); pull-request checks that annotate suspect steps.
    • Assumptions/dependencies: References are used as anchors; not a substitute for domain-expert validation in safety-critical contexts.
  • Content moderation and safety gate for user-posted how-to material
    • Sector: social/content platforms, marketplaces
    • Application: Pre-publication screening for unsafe or fatally flawed instructions; triage to human moderators.
    • Tools/workflows: Queue-based moderation with topic-aware thresholds; per-topic safety policies (e.g., health, legal).
    • Assumptions/dependencies: Additional hazard classifiers are recommended; the judge targets “critical failure,” not comprehensive safety.
  • Vendor benchmarking and procurement due diligence for AI buyers
    • Sector: public sector, regulated industries, enterprise AI procurement
    • Application: Compare closed/open models with a reproducible, low-cost procedural validity score; request difficulty-controlled breakdowns (by step count, topic).
    • Tools/workflows: RFPs include How2BENCH scores and per-topic reports; third-party audit scripts using How2JUDGE.
    • Assumptions/dependencies: Standardized prompts and length controls; public documentation of limitations to avoid over-claiming “real-world correctness.”
  • Educational content authoring and review
    • Sector: education, workforce training
    • Application: Generate and validate stepwise lesson plans, lab protocols, and practice tasks; ensure key parameters (times, quantities) are present.
    • Tools/workflows: LMS plugins to draft procedures and run a “critical failure check” before publishing to students.
    • Assumptions/dependencies: Domain calibration; non-executable judging means expert review is still needed for labs and high-risk activities.
  • Troubleshooting assistants with self-check-and-fix loops
    • Sector: IT support, electronics, smart-home support
    • Application: LLM generates a fix workflow; How2JUDGE flags missing preconditions or unsafe steps; the agent revises before presenting to the user.
    • Tools/workflows: Ticketing-system integration; “revise-on-failure” agent policy with topic-specific hints.
    • Assumptions/dependencies: UI-interaction steps were filtered out in the dataset—mapping to real UI actions requires extension; careful logging for accountability.
  • Data curation and augmentation for SFT/midtraining
    • Sector: AI training
    • Application: Use How2TRAIN to augment instruction tuning for planning-oriented skills; select topics where current models underperform.
    • Tools/workflows: Curriculum that increases step count and topic diversity over time; contamination checks for overlap with eval splits.
    • Assumptions/dependencies: Licensing and privacy compliance for mined web data; modest memorization risk per paper (+3 points observed under aggressive contamination).
  • Research instrumentation for scaling-laws, PRMs, and evaluation science
    • Sector: academia/AI evaluation
    • Application: Study emergence and decoupling of formatting vs validity; compare perplexity vs task-level validity; develop length-robust evaluators.
    • Tools/workflows: Open How2BENCH split; How2JUDGE as an affordable judge; cross-judge robustness analyses.
    • Assumptions/dependencies: Acceptance of binary aggregation (“any critical failure”) as the primary endpoint; replication across languages and domains pending.

Long-Term Applications

  • Execution-grounded procedure verification and robotics planning
    • Sector: robotics, smart manufacturing, household robotics
    • Application: Bridge descriptive procedures to executable plans via simulators or symbolic planners; use How2SCORE-like PRMs to detect catastrophic plan flaws pre-execution.
    • Tools/workflows: Text-to-plan translation; simulator-in-the-loop verification; PRMs aligned to execution traces.
    • Assumptions/dependencies: Reliable grounding and state estimation; domain simulators; safety cases for physical actuation.
  • Enterprise process automation and RPA co-pilots
    • Sector: finance operations, HR, IT, supply chain
    • Application: Convert validated stepwise procedures into RPA scripts; judge detects missing approvals or compliance steps before deployment.
    • Tools/workflows: Procedure-to-RPA compilers; compliance-aware judges; change-management pipelines.
    • Assumptions/dependencies: Current dataset excludes UI/GUI steps; requires extending mining to UI flows and integrating with enterprise systems; strong governance.
  • Sector-specialized judges for high-stakes domains
    • Sector: healthcare, legal, aviation, chemical safety
    • Application: Distill domain-specific How2JUDGE variants using expert-labeled data; incorporate stronger hazard taxonomies beyond “critical failure.”
    • Tools/workflows: Expert-in-the-loop distillation; test suites with adverse-event cases; hierarchical safety gating.
    • Assumptions/dependencies: Access to expert annotations; regulatory alignment; rigorous validation (the base judge is ~80.5% human agreement, insufficient alone for safety-critical use).
  • Standards and certification for AI procedural reliability
    • Sector: policy/regulation, industry consortia
    • Application: Establish certification tiers for procedural assistants (topic coverage, difficulty bands, failure rates); require reporting of judge calibration and cross-judge robustness.
    • Tools/workflows: Public benchmarks with versioning; auditing protocols; procurement templates referencing certified scores.
    • Assumptions/dependencies: Multi-stakeholder consensus on metrics; governance for updates and deprecation; mitigation of judge bias/self-preference.
  • Continual web-scale mining for living knowledge bases
    • Sector: knowledge platforms, documentation tooling
    • Application: Periodically refresh topic-balanced how-to corpora; detect drift and outdated steps; auto-suggest updates to manuals.
    • Tools/workflows: Scheduled How2MINE jobs; deduplication and link rot handling; change-impact analysis.
    • Assumptions/dependencies: Sustainable API/computation costs; robust safety/privacy filtering; publisher permissions.
  • Multilingual and low-resource expansion
    • Sector: global education, international support centers
    • Application: Mine and judge how-to procedures across languages; train multilingual judges; close performance gaps in non-English settings.
    • Tools/workflows: Cross-lingual topic classifiers; multilingual distillation of judges; alignment with local regulations/cultural norms.
    • Assumptions/dependencies: Availability of diverse web corpora; language-specific safety and legal considerations.
  • Authoring IDEs with real-time procedural linting and repair
    • Sector: technical writing, medical device documentation, aviation MRO
    • Application: As authors type procedures, the system flags likely critical failures and proposes patches; enforces step counts and parameter specificity.
    • Tools/workflows: Editor plugins; structured authoring schemas; change justification logs.
    • Assumptions/dependencies: Domain-specific templates and controlled vocabularies; human acceptance and traceability requirements.
  • Autonomous planning agents with self-verification
    • Sector: AI agents for research, operations, and daily life
    • Application: Agents generate multi-step plans, self-judge for critical failures, and iteratively refine before acting or delegating to tools/humans.
    • Tools/workflows: Plan-generate–judge–revise loops; hybrid PRMs combining textual and tool-execution signals.
    • Assumptions/dependencies: Robust integration with tool APIs; calibrated confidence thresholds to avoid infinite revision loops.
  • Emergency and disaster-response checklist generation
    • Sector: public safety, humanitarian logistics
    • Application: Rapidly compose context-specific checklists (evacuation, field triage) and auto-flag critical omissions; adapt across locales and constraints.
    • Tools/workflows: Incident-specific templates; human-in-the-loop validation workflows; multilingual dissemination.
    • Assumptions/dependencies: Formal endorsement by authorities; rigorous red teaming; offline capability and reliability.
  • Patient education and discharge instruction QC at scale
    • Sector: healthcare providers, payers
    • Application: Screen patient-facing instructions for missing parameters (doses, timings), contradictions, or unsafe steps; prioritize for clinician review.
    • Tools/workflows: EHR-integrated pipelines; clinician sign-off dashboards; audit trails.
    • Assumptions/dependencies: Regulatory compliance (HIPAA, GDPR); domain-specialized judges; clinician oversight; recognition that non-executable judging is an aid, not a guarantee.

Notes on feasibility and dependencies across applications:

  • The judge is a proxy: 80.5% agreement with human majority implies residual error; safety-critical use demands expert verification or execution-based checks.
  • Training and evaluation rely on mined web data; quality, licensing, and privacy filtering are prerequisites.
  • Reported gains require careful reward design to avoid verbosity/format gaming; length control and external judge checks are advisable.
  • Topic coverage matters for generalization; single-topic training transfers but is weaker than broad topic coverage.
  • Current datasets exclude UI-interaction procedures; RPA/GUI automation requires extending the mining and judging criteria.

Glossary

  • BLEU: A string-overlap metric in NLP that measures n-gram overlap between a candidate and reference text. "string-overlap metrics like BLEU are fast to compute but inaccurate,"
  • binary score aggregation: Reducing detailed judgments to a binary success/failure label to improve reliability and comparability. "Binary score aggregation. From the judge output list, we derive a binary label:"
  • constraint satisfaction: Evaluating whether generated outputs meet a set of stated constraints. "constraint satisfaction (Yuan et al., 2023)"
  • critical failure: Any omission, contradiction, extraneous action, or severe vagueness that prevents achieving the goal or makes instructions unusable. "an LLM-based evaluation protocol that asks whether a generated procedure contains any critical failure that prevents achieving its goal."
  • distillation: Training a smaller model to mimic a larger “teacher” model’s judgments or behavior. "we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators."
  • edge prediction: Predicting relationships (edges) between steps in a procedural graph. "edge prediction over step pairs (Sakaguchi et al., 2021)"
  • embedding analysis: Using vector representations to analyze data structure, diversity, or clustering. "Via embedding analysis, we select two topics with contrasting diversity"
  • execution-based verification: Validating correctness by executing actions in an environment rather than comparing to text. "serve as reference anchors when execution-based verification is infeasible."
  • FLOPs: A measure of computational cost indicating the number of floating-point operations used during training. "about 5% into pretraining for a 1B model (around 1021 training FLOPs)"
  • formal transition systems: Mathematical models of state transitions used to evaluate executability of procedures. "such as formal transition systems (Samiei et al., 2025)"
  • frontier model: A state-of-the-art, cutting-edge large model, typically with strong performance and high cost. "we distill a frontier model into an open 8B model"
  • goal-conditioned: Conditioned on a specific goal to generate or evaluate the sequence of actions. "goal-conditioned procedure generation"
  • grounded environments: Environments with explicit states and actions, enabling executable evaluation. "grounded environments with explicit state transitions"
  • Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm that optimizes policies using group-relative rewards. "using Group Relative Policy Optimization (GRPO) (Shao et al., 2024)"
  • inter-annotator agreement: The degree of consistency among different human annotators. "Inter-annotator agreement. With binary score aggregation, we observe Krippendorff's a = 0.593."
  • Krippendorff's alpha: A statistical measure of inter-annotator reliability across multiple raters and categories. "Krippendorff's a = 0.593."
  • leave-one-out agreement: Agreement measured by comparing each annotator to the majority formed by all others. "we measure leave-one-out agreement among human annotators"
  • length gaming: Inflating output length to exploit evaluator biases and artificially boost scores. "consistent with length gaming"
  • LLM-as-a-judge: Using a LLM to automatically evaluate and score generated outputs. "LLM-as-a-judge protocols have been used as a general approach to scale evaluation for open-ended generation"
  • n-gram overlap: Repetition or similarity measured across contiguous sequences of n tokens. "high n-gram overlap within the extracted steps."
  • open-world setting: An evaluation context with diverse, unconstrained tasks and no fixed executor or oracle. "Evaluating and optimizing end-to-end procedural validity is challenging in open-world settings"
  • oracle: An idealized, authoritative source of truth or automatic correctness signal. "without a task-specific executor or automatic oracle."
  • out-of-domain (OOD): Data or evaluations that differ from the training domain or task distribution. "without systematic regressions on standard out-of-domain benchmarks"
  • perplexity: A likelihood-based metric indicating how well a model predicts reference text; lower is better. "perplexity on reference procedure"
  • process reward models (PRMs): Verifiers that assess intermediate steps in reasoning to provide step-level rewards. "process reward models (PRMs) for mathematical reasoning"
  • reinforcement learning (RL): Learning by optimizing model behavior with respect to a reward signal. "RL using How2SCORE as a reward improves performance on How2BENCH"
  • self-preference bias: A judge’s tendency to favor outputs from models similar to itself. "No evidence of LLM judge self-preference bias."
  • Spearman correlation: A rank-based correlation coefficient assessing monotonic relationships between ordered variables. "the Spearman correlation between checkpoint rank by How2EVERYTHING and rank by perplexity"
  • stratified sampling: Sampling that preserves proportions across predefined strata such as topics. "perform stratified sampling across 14 topics."
  • string-overlap metrics: Metrics that gauge similarity by comparing surface text overlap rather than meaning or validity. "string-overlap metrics like BLEU"
  • verbosity bias: Evaluator tendency to prefer longer outputs, potentially inflating scores. "judges are prone to verbosity bias"
  • WebOrganizer: A classifier used to label web pages by format and topic for data sampling. "as identified by WebOrganizer (Wettig et al., 2025)"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 60 likes about this paper.