How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
Abstract: Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about teaching and testing AI chatbots on “how-to” tasks—step-by-step instructions for doing something, like cooking a recipe, fixing a gadget, or handling a legal form. The authors build a big, reusable system called “How2Everything” to collect real-world how-to procedures from the web, judge whether AI-generated steps would actually work, and then use that judgment to help AIs get better at writing instructions.
What questions are the researchers trying to answer?
Here are the main questions, explained simply:
- How can we gather lots of diverse, real-world step-by-step guides from the web to use for training and testing AI?
- How can we judge whether an AI’s step-by-step instructions are actually usable and would let someone achieve the goal?
- Do bigger and better-trained AIs write more reliable how-to instructions?
- Can we use this judging system as feedback to improve AI instruction writing without breaking other skills (like math, coding, or general knowledge)?
- Are improvements real, or are AIs just memorizing web pages or learning a “nice format” without truly better content?
How did they do it?
The authors built a framework with four main parts. Think of it like building a giant “how-to” library, a fair referee, a test set, and a training program.
1) Collecting step-by-step guides from the web (How2MINE)
They scan about 980,000 web pages that look like tutorials and extract clean, structured procedures across 14 topics (like Food & Dining, Health, Electronics, Science & Tech, and more). After filtering and cleaning, they end up with 351,000 usable procedures.
To make each example clear and testable, they include:
- The goal (what you’re trying to do)
- A list of resources (tools or ingredients mentioned in the steps)
- The reference steps (the original step-by-step instructions from the web page)
From this big pool:
- They build an evaluation set called How2Bench (7,000 examples; 500 per topic)
- They create a training set called How2Train (the remaining 300k+ examples)
2) Judging AI-written procedures for “critical failures” (How2SCORE + How2JUDGE)
They define a “critical failure” as a mistake that would stop you from reaching the goal or make the instructions unusable. Examples:
- Skipping a required waiting period in a legal process
- Leaving out key parameters like cooking time or temperature, so you can’t actually finish the recipe
- Doing unsafe or contradictory steps
They use a strong AI judge (a frontier model) to check AI outputs for these failures and then “distill” its judging skills into a smaller, open model called How2Judge (8 billion parameters). Distilling means training a smaller model to mimic a bigger one so it’s cheaper and easier to use. How2Judge agrees with human annotators about 80.5% of the time, which is good enough for large-scale testing and training.
3) Testing lots of AIs (How2Bench)
They ask many different AI models to write procedures for the same goals and score them with the judge. Models get a “success rate” based on how often their instructions contain no critical failures. They control the number of steps (so every model outputs the same count) to keep comparisons fair.
4) Improving AIs using the judge’s feedback (Reinforcement Learning)
They use the judge’s score as a reward signal in reinforcement learning (RL), a training method where the AI tries things, gets feedback, and adjusts. They add simple format checks and length control so models don’t “game” the score by being overly long or flowery. The result: models get better at writing usable how-to procedures.
What did they find, and why does it matter?
Main findings
- Scale and diversity: They mined 351,000 real-world procedures across 14 topics from 980,000 web pages. That’s a huge, diverse dataset.
- Reliable judging at low cost: Their small judge (How2Judge) matches human labels about 80.5% of the time, making it practical to use at scale.
- Clear performance trends: On How2Bench, bigger and better-trained models make fewer critical mistakes. The benchmark shows smooth “scaling” behavior and can track improvements even early in training.
- RL makes models better at how-to writing: Using the judge’s feedback, models improved their How2Bench scores by more than 10 points across three different models.
- No broad harm to other skills: After RL, models did not consistently get worse on 12 external benchmarks covering math, coding, knowledge, and reasoning. Some scores even improved.
- Improvements aren’t just formatting or memorization: Gains weren’t mainly due to learning a neat style or repeating training data. For example, even when source documents were shown multiple times during training, the how-to scores only rose modestly.
Why it matters
- How-to guidance is a common use for chatbots. Making this skill measurable—and trainable—helps build assistants that give safer, more reliable instructions.
- The judge focuses on “would this actually work?” rather than word overlap or surface niceness. That means better real-world usefulness.
- The approach creates a “closed loop”: collect web data → evaluate models reliably → use the evaluation to improve models → test again. This loop can push steady progress.
What methods or terms might feel technical?
Here are a few, explained with simple analogies:
- LLM: A very advanced text-predicting AI. Think of it as a super smart autocomplete that can write full answers.
- Benchmark: A standardized test set. Like a consistent quiz used to compare different students (models).
- Critical failure: A serious mistake that would prevent success. Imagine a recipe that never tells you to bake the cake—no matter how pretty the instructions look, you won’t get a cake.
- Distillation: Teaching a smaller AI to judge like a bigger, more expensive AI. Like a trainee learning from a master so the trainee can handle day-to-day work cheaply.
- Reinforcement Learning (RL): Training with feedback signals. Think of a coach giving a score after each try; the player (the model) adjusts to get higher scores next time.
What could this change in the real world?
- More dependable AI assistants: Better, step-by-step instructions for everyday tasks (cooking, DIY, travel), troubleshooting (electronics, home repairs), and planning (study routines, fitness).
- Safer, clearer guidance: The judge penalizes missing steps, contradictions, and unsafe actions, nudging models toward practical instructions.
- A reusable process: Other teams can use the data, the judge, and the protocol to build and improve how-to capabilities in their own models.
Final takeaway
The authors show that collecting huge numbers of real-world how-to procedures and judging them for “will this actually work?” can both test and improve AI models at scale. Their system helps AIs write better step-by-step instructions without harming other skills, and it moves us toward chatbots that are more useful and trustworthy for everyday tasks.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored, framed to be actionable for future work:
- Executable validation gap: No correlation study between How2SCORE and actual task success under execution (simulation, robotic/GUI environments, or expert enactment); design execution-grounded subsets to calibrate the judge and metric.
- Limited human validation: Only 200 examples with non-expert annotators and moderate agreement (α≈0.59); expand to larger, topic-stratified human (and domain-expert) panels with confidence/rationale capture and report CIs.
- Reference anchoring risk: References may contain errors and are treated as anchors; quantify how often valid alternative procedures are wrongly penalized and introduce multi-reference/equivalence-class judging.
- Binary metric coarseness: A single “any critical failure” label loses severity and partial-credit signal; develop severity-weighted scores, per-step localization, and PRM-style labels for procedural tasks.
- Judge generalization and bias: How2JUDGE is distilled from a single teacher and trained on 73K labels; stress-test cross-domain and cross-family biases, style/verbosity sensitivity beyond length, and robustness to adversarial generations that target judge weaknesses.
- External validation breadth: Gains were checked on a small set of external judges; broaden to judge ensembles (including open-source), mixed human+LLM meta-judging, and report stability across evaluators.
- Reward hacking beyond length: Length control is addressed, but other gaming avenues (e.g., over-hedging, generic safety disclaimers, superficial parameter stuffing) are untested; build targeted adversarial evaluations and reward guards.
- Evaluation controls vs natural use: Models are given resource lists and exact step counts, which users typically don’t provide; measure performance without R and n, add a step-count estimation task, and evaluate robustness to omitted/incorrect resources.
- Coverage exclusions: Pipeline filters out UI-based procedures, named-entity–specific tasks, and <5 or >15 step tasks; extend mining to digital/GUI tasks, brand/jurisdiction-specific instructions, and very short/long or hierarchical procedures.
- Long-horizon and branching structures: Current evaluation is linear; introduce tasks with conditional branches, loops, and interdependencies, and judge correctness of conditionality (e.g., if/else prerequisites).
- Safety characterization: No targeted analysis of safety-critical failure modes (health/legal/industrial); create safety-graded subsets, risk-weight critical failures, and measure whether RL increases unsafe assertiveness or suppresses prudent warnings.
- Domain expertise: Human validation avoided specialized topics, yet benchmark includes them; build expert-annotated test slices (health/legal/industrial) and compare expert vs generalist judgments and judges’ calibration.
- Memorization/confounding: Memorization tests focus on midtraining; extend to pretraining, publish document URL/hash overlap stats, and perform per-example contamination audits between model corpora and How2TRAIN/BENCH.
- Topic and cultural bias: The 14-topic, English, web-mined data may be geographically/culturally skewed; audit domain/geography distributions, add multilingual procedures, and test jurisdiction-specific legal/health variability.
- Data licensing and privacy: Web-mined content raises licensing/PII questions; detail licenses, PII filtering efficacy, and provide a replicable, legally-compliant release (URLs/hashes, removal requests).
- Mining reproducibility: Mining relies on closed APIs (GPT-4.1) for multiple stages; provide open-model extractors/filters, report accuracy/cost trade-offs, and quantify delta in dataset quality vs closed-API versions.
- Judge training transparency: Release (or summarize) the 73K judge-training annotations (with decontamination proofs) and ablate teacher choices to quantify teacher bias imprinting on the distilled judge.
- OOD impact breadth: OOD assessments cover 12 benchmarks but omit alignment metrics (toxicity, helpfulness, factuality, calibration); add these and perform user studies on helpfulness and trust post-RL.
- RL stability and design: Only GRPO and one reward mix were tested; run ablations over algorithms, reward weights, seeds, and curriculum schedules; assess catastrophic forgetting and long-run stability.
- Tool-augmented agents: No evaluation of retrieval/tool use for reducing critical failures; benchmark agentic variants (retrieval, calculators, checklists) and measure judge/human-validated gains.
- Difficulty modeling: Step count correlates with difficulty, but other factors (parameter density, prerequisite chaining, domain hazard level) are unmodeled; create calibrated difficulty strata and report per-stratum performance.
- Resource extraction fidelity: Resource lists are derived from references and may be incomplete; audit extraction errors and modify judging to accept justified additions of missing but necessary resources.
- Per-topic diagnostics: Beyond aggregate scores, publish per-topic failure taxonomies and ablate topic-balanced vs real-world frequency-weighted evaluation to understand transfer and deployment relevance.
- Naturalness and usability: Evaluation enforces one concise sentence per step; measure effects of more natural, multi-sentence steps, inline rationales, and formatting variants on judge agreement and human usability.
- Alternative-validity stress tests: Construct “diversity” sets where multiple distinct procedures are valid; measure judge tolerance to procedural diversity and develop prompts/training to reduce over-anchoring to references.
- Multimodal procedures: Many how-tos are image/video-first; extend mining and evaluation to multimodal inputs/outputs and test whether multimodality reduces critical failures.
- Scaling laws and forecasting: Formalize performance–compute/token scaling fits with uncertainty bands, compare to perplexity-based predictors, and test cross-run generalization beyond OLMo trajectories.
Practical Applications
Below, the paper’s contributions (How2MINE, How2BENCH, How2SCORE, How2JUDGE, and RL with How2SCORE) are translated into practical applications with sectors, deployability, and feasibility notes.
Immediate Applications
- CI-grade benchmark for procedural quality in LLM training and model selection
- Sector: software/AI platforms and model labs
- Application: Integrate How2BENCH + How2SCORE (with the open 8B How2JUDGE) into continuous evaluation to track scaling trends, catch regressions, and gate releases; slice difficulty by step count to diagnose failure modes early in pretraining/midtraining.
- Tools/workflows: Add a “procedural validity” job to eval harnesses; nightly dashboards; per-topic scorecards across 14 topics.
- Assumptions/dependencies: Access to How2BENCH and How2JUDGE; acceptance that 80.5% human agreement is sufficient for gating; enforce length/format controls to avoid verbosity gaming.
- RL-based improvement of how-to generation without broad OOD regressions
- Sector: software/AI training
- Application: Use How2SCORE (scored by How2JUDGE) as a reward in GRPO/PPO-style RL to boost end-to-end procedural validity (+10 points reported) while maintaining standard benchmark performance.
- Tools/workflows: Extend post-training pipelines (RLHF/RLAIF) with an additional How2SCORE reward head; include a length-control reward and a format verifier to prevent gaming.
- Assumptions/dependencies: Compute budget for RL; careful reward design (length control, format checks); safety review of reward-induced behaviors.
- Low-cost LLM judge for production guardrails on how-to responses
- Sector: consumer assistants, enterprise copilots, helpdesk chat
- Application: Run-time screening of generated procedures for “critical failures” before display; auto-regenerate until no_failure or escalate to human.
- Tools/workflows: On-call guardrail microservice using How2JUDGE; retry-and-revise loops with targeted feedback from detected failure types (omission, unsafe step, contradiction).
- Assumptions/dependencies: Latency/SLA budgets for a second-pass judge; judge bias and 80.5% agreement imply residual risk—should be paired with safety policies.
- SOP and runbook quality checking (“procedure linting”)
- Sector: DevOps/IT, manufacturing, healthcare operations, customer support
- Application: Batch-scan standard operating procedures, knowledge-base articles, and runbooks to flag critical omissions or contradictions; improve handoffs and on-call playbooks.
- Tools/workflows: CMS/Confluence/GitHub Actions plugin (“How2 Lint”); pull-request checks that annotate suspect steps.
- Assumptions/dependencies: References are used as anchors; not a substitute for domain-expert validation in safety-critical contexts.
- Content moderation and safety gate for user-posted how-to material
- Sector: social/content platforms, marketplaces
- Application: Pre-publication screening for unsafe or fatally flawed instructions; triage to human moderators.
- Tools/workflows: Queue-based moderation with topic-aware thresholds; per-topic safety policies (e.g., health, legal).
- Assumptions/dependencies: Additional hazard classifiers are recommended; the judge targets “critical failure,” not comprehensive safety.
- Vendor benchmarking and procurement due diligence for AI buyers
- Sector: public sector, regulated industries, enterprise AI procurement
- Application: Compare closed/open models with a reproducible, low-cost procedural validity score; request difficulty-controlled breakdowns (by step count, topic).
- Tools/workflows: RFPs include How2BENCH scores and per-topic reports; third-party audit scripts using How2JUDGE.
- Assumptions/dependencies: Standardized prompts and length controls; public documentation of limitations to avoid over-claiming “real-world correctness.”
- Educational content authoring and review
- Sector: education, workforce training
- Application: Generate and validate stepwise lesson plans, lab protocols, and practice tasks; ensure key parameters (times, quantities) are present.
- Tools/workflows: LMS plugins to draft procedures and run a “critical failure check” before publishing to students.
- Assumptions/dependencies: Domain calibration; non-executable judging means expert review is still needed for labs and high-risk activities.
- Troubleshooting assistants with self-check-and-fix loops
- Sector: IT support, electronics, smart-home support
- Application: LLM generates a fix workflow; How2JUDGE flags missing preconditions or unsafe steps; the agent revises before presenting to the user.
- Tools/workflows: Ticketing-system integration; “revise-on-failure” agent policy with topic-specific hints.
- Assumptions/dependencies: UI-interaction steps were filtered out in the dataset—mapping to real UI actions requires extension; careful logging for accountability.
- Data curation and augmentation for SFT/midtraining
- Sector: AI training
- Application: Use How2TRAIN to augment instruction tuning for planning-oriented skills; select topics where current models underperform.
- Tools/workflows: Curriculum that increases step count and topic diversity over time; contamination checks for overlap with eval splits.
- Assumptions/dependencies: Licensing and privacy compliance for mined web data; modest memorization risk per paper (+3 points observed under aggressive contamination).
- Research instrumentation for scaling-laws, PRMs, and evaluation science
- Sector: academia/AI evaluation
- Application: Study emergence and decoupling of formatting vs validity; compare perplexity vs task-level validity; develop length-robust evaluators.
- Tools/workflows: Open How2BENCH split; How2JUDGE as an affordable judge; cross-judge robustness analyses.
- Assumptions/dependencies: Acceptance of binary aggregation (“any critical failure”) as the primary endpoint; replication across languages and domains pending.
Long-Term Applications
- Execution-grounded procedure verification and robotics planning
- Sector: robotics, smart manufacturing, household robotics
- Application: Bridge descriptive procedures to executable plans via simulators or symbolic planners; use How2SCORE-like PRMs to detect catastrophic plan flaws pre-execution.
- Tools/workflows: Text-to-plan translation; simulator-in-the-loop verification; PRMs aligned to execution traces.
- Assumptions/dependencies: Reliable grounding and state estimation; domain simulators; safety cases for physical actuation.
- Enterprise process automation and RPA co-pilots
- Sector: finance operations, HR, IT, supply chain
- Application: Convert validated stepwise procedures into RPA scripts; judge detects missing approvals or compliance steps before deployment.
- Tools/workflows: Procedure-to-RPA compilers; compliance-aware judges; change-management pipelines.
- Assumptions/dependencies: Current dataset excludes UI/GUI steps; requires extending mining to UI flows and integrating with enterprise systems; strong governance.
- Sector-specialized judges for high-stakes domains
- Sector: healthcare, legal, aviation, chemical safety
- Application: Distill domain-specific How2JUDGE variants using expert-labeled data; incorporate stronger hazard taxonomies beyond “critical failure.”
- Tools/workflows: Expert-in-the-loop distillation; test suites with adverse-event cases; hierarchical safety gating.
- Assumptions/dependencies: Access to expert annotations; regulatory alignment; rigorous validation (the base judge is ~80.5% human agreement, insufficient alone for safety-critical use).
- Standards and certification for AI procedural reliability
- Sector: policy/regulation, industry consortia
- Application: Establish certification tiers for procedural assistants (topic coverage, difficulty bands, failure rates); require reporting of judge calibration and cross-judge robustness.
- Tools/workflows: Public benchmarks with versioning; auditing protocols; procurement templates referencing certified scores.
- Assumptions/dependencies: Multi-stakeholder consensus on metrics; governance for updates and deprecation; mitigation of judge bias/self-preference.
- Continual web-scale mining for living knowledge bases
- Sector: knowledge platforms, documentation tooling
- Application: Periodically refresh topic-balanced how-to corpora; detect drift and outdated steps; auto-suggest updates to manuals.
- Tools/workflows: Scheduled How2MINE jobs; deduplication and link rot handling; change-impact analysis.
- Assumptions/dependencies: Sustainable API/computation costs; robust safety/privacy filtering; publisher permissions.
- Multilingual and low-resource expansion
- Sector: global education, international support centers
- Application: Mine and judge how-to procedures across languages; train multilingual judges; close performance gaps in non-English settings.
- Tools/workflows: Cross-lingual topic classifiers; multilingual distillation of judges; alignment with local regulations/cultural norms.
- Assumptions/dependencies: Availability of diverse web corpora; language-specific safety and legal considerations.
- Authoring IDEs with real-time procedural linting and repair
- Sector: technical writing, medical device documentation, aviation MRO
- Application: As authors type procedures, the system flags likely critical failures and proposes patches; enforces step counts and parameter specificity.
- Tools/workflows: Editor plugins; structured authoring schemas; change justification logs.
- Assumptions/dependencies: Domain-specific templates and controlled vocabularies; human acceptance and traceability requirements.
- Autonomous planning agents with self-verification
- Sector: AI agents for research, operations, and daily life
- Application: Agents generate multi-step plans, self-judge for critical failures, and iteratively refine before acting or delegating to tools/humans.
- Tools/workflows: Plan-generate–judge–revise loops; hybrid PRMs combining textual and tool-execution signals.
- Assumptions/dependencies: Robust integration with tool APIs; calibrated confidence thresholds to avoid infinite revision loops.
- Emergency and disaster-response checklist generation
- Sector: public safety, humanitarian logistics
- Application: Rapidly compose context-specific checklists (evacuation, field triage) and auto-flag critical omissions; adapt across locales and constraints.
- Tools/workflows: Incident-specific templates; human-in-the-loop validation workflows; multilingual dissemination.
- Assumptions/dependencies: Formal endorsement by authorities; rigorous red teaming; offline capability and reliability.
- Patient education and discharge instruction QC at scale
- Sector: healthcare providers, payers
- Application: Screen patient-facing instructions for missing parameters (doses, timings), contradictions, or unsafe steps; prioritize for clinician review.
- Tools/workflows: EHR-integrated pipelines; clinician sign-off dashboards; audit trails.
- Assumptions/dependencies: Regulatory compliance (HIPAA, GDPR); domain-specialized judges; clinician oversight; recognition that non-executable judging is an aid, not a guarantee.
Notes on feasibility and dependencies across applications:
- The judge is a proxy: 80.5% agreement with human majority implies residual error; safety-critical use demands expert verification or execution-based checks.
- Training and evaluation rely on mined web data; quality, licensing, and privacy filtering are prerequisites.
- Reported gains require careful reward design to avoid verbosity/format gaming; length control and external judge checks are advisable.
- Topic coverage matters for generalization; single-topic training transfers but is weaker than broad topic coverage.
- Current datasets exclude UI-interaction procedures; RPA/GUI automation requires extending the mining and judging criteria.
Glossary
- BLEU: A string-overlap metric in NLP that measures n-gram overlap between a candidate and reference text. "string-overlap metrics like BLEU are fast to compute but inaccurate,"
- binary score aggregation: Reducing detailed judgments to a binary success/failure label to improve reliability and comparability. "Binary score aggregation. From the judge output list, we derive a binary label:"
- constraint satisfaction: Evaluating whether generated outputs meet a set of stated constraints. "constraint satisfaction (Yuan et al., 2023)"
- critical failure: Any omission, contradiction, extraneous action, or severe vagueness that prevents achieving the goal or makes instructions unusable. "an LLM-based evaluation protocol that asks whether a generated procedure contains any critical failure that prevents achieving its goal."
- distillation: Training a smaller model to mimic a larger “teacher” model’s judgments or behavior. "we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators."
- edge prediction: Predicting relationships (edges) between steps in a procedural graph. "edge prediction over step pairs (Sakaguchi et al., 2021)"
- embedding analysis: Using vector representations to analyze data structure, diversity, or clustering. "Via embedding analysis, we select two topics with contrasting diversity"
- execution-based verification: Validating correctness by executing actions in an environment rather than comparing to text. "serve as reference anchors when execution-based verification is infeasible."
- FLOPs: A measure of computational cost indicating the number of floating-point operations used during training. "about 5% into pretraining for a 1B model (around 1021 training FLOPs)"
- formal transition systems: Mathematical models of state transitions used to evaluate executability of procedures. "such as formal transition systems (Samiei et al., 2025)"
- frontier model: A state-of-the-art, cutting-edge large model, typically with strong performance and high cost. "we distill a frontier model into an open 8B model"
- goal-conditioned: Conditioned on a specific goal to generate or evaluate the sequence of actions. "goal-conditioned procedure generation"
- grounded environments: Environments with explicit states and actions, enabling executable evaluation. "grounded environments with explicit state transitions"
- Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm that optimizes policies using group-relative rewards. "using Group Relative Policy Optimization (GRPO) (Shao et al., 2024)"
- inter-annotator agreement: The degree of consistency among different human annotators. "Inter-annotator agreement. With binary score aggregation, we observe Krippendorff's a = 0.593."
- Krippendorff's alpha: A statistical measure of inter-annotator reliability across multiple raters and categories. "Krippendorff's a = 0.593."
- leave-one-out agreement: Agreement measured by comparing each annotator to the majority formed by all others. "we measure leave-one-out agreement among human annotators"
- length gaming: Inflating output length to exploit evaluator biases and artificially boost scores. "consistent with length gaming"
- LLM-as-a-judge: Using a LLM to automatically evaluate and score generated outputs. "LLM-as-a-judge protocols have been used as a general approach to scale evaluation for open-ended generation"
- n-gram overlap: Repetition or similarity measured across contiguous sequences of n tokens. "high n-gram overlap within the extracted steps."
- open-world setting: An evaluation context with diverse, unconstrained tasks and no fixed executor or oracle. "Evaluating and optimizing end-to-end procedural validity is challenging in open-world settings"
- oracle: An idealized, authoritative source of truth or automatic correctness signal. "without a task-specific executor or automatic oracle."
- out-of-domain (OOD): Data or evaluations that differ from the training domain or task distribution. "without systematic regressions on standard out-of-domain benchmarks"
- perplexity: A likelihood-based metric indicating how well a model predicts reference text; lower is better. "perplexity on reference procedure"
- process reward models (PRMs): Verifiers that assess intermediate steps in reasoning to provide step-level rewards. "process reward models (PRMs) for mathematical reasoning"
- reinforcement learning (RL): Learning by optimizing model behavior with respect to a reward signal. "RL using How2SCORE as a reward improves performance on How2BENCH"
- self-preference bias: A judge’s tendency to favor outputs from models similar to itself. "No evidence of LLM judge self-preference bias."
- Spearman correlation: A rank-based correlation coefficient assessing monotonic relationships between ordered variables. "the Spearman correlation between checkpoint rank by How2EVERYTHING and rank by perplexity"
- stratified sampling: Sampling that preserves proportions across predefined strata such as topics. "perform stratified sampling across 14 topics."
- string-overlap metrics: Metrics that gauge similarity by comparing surface text overlap rather than meaning or validity. "string-overlap metrics like BLEU"
- verbosity bias: Evaluator tendency to prefer longer outputs, potentially inflating scores. "judges are prone to verbosity bias"
- WebOrganizer: A classifier used to label web pages by format and topic for data sampling. "as identified by WebOrganizer (Wettig et al., 2025)"
Collections
Sign up for free to add this paper to one or more collections.