Mind2Web-live Benchmark

Updated 11 December 2025

Mind2Web-live is an online benchmark for web agents, featuring live demonstrations and key-node states to enable dynamic evaluation in evolving web environments.
It employs multi-phase filtering, re-annotation, and replay mechanisms to ensure that 542 diverse web tasks yield accurate, reproducible evaluation data.
Comprehensive metrics like Completion Rate and Task Success Rate reveal the current challenges in web automation, guiding improvements for robust agent performance.

Mind2Web-live is a real-world, online evaluation benchmark for web agents, developed as part of the WebCanvas framework to enable robust assessment of agents in dynamic, continuously evolving web environments. Derived through rigorous re-annotation and curation from the original Mind2Web static dataset, Mind2Web-live supplies both detailed demonstrations and precise intermediate evaluation states—termed “key nodes”—to facilitate nuanced, progress-aware measurement of agent competence. The resource serves as an open-access testbed for both concrete and abstracted functionality navigation, playing a central role in recent research on functionality-guided web automation and benchmarking (Pan et al., 2024, Shahbandeh et al., 2024).

1. Dataset Construction and Curation

Mind2Web-live originates from the Mind2Web static dataset, which contains 2,350 web navigation tasks captured as offline HTML snapshots. To ensure ongoing relevance for live web agents, an intensive multi-phase filtering and annotation pipeline is employed:

Source Selection and Filtering: The static dataset is pruned to exclude tasks referencing absolute dates, times, or other temporally sensitive content, as such instructions quickly become unexecutable on the evolving web.
Sampling and Re-annotation: 601 tasks from the training split and all 179 “cross-task” test examples are sampled. Each is replayed in a live browser using the iMean Builder plugin, which records all actions (clicks, form inputs) along with selector paths, texts, and screenshots in real time.
Pruning: Expired workflows no longer supported by live sites (96 cases) and ambiguously specified or unannotatable instructions (142 cases) are removed. Instructions for 51 additional tasks are revised to eliminate ambiguity and ensure reproducibility.
Final Curation: The resulting Mind2Web-live dataset contains 542 tasks (438 train, 104 test). Across these, 2,439 key-node states are identified (average 4.5 per task), and annotators record 4,550 low-level actions (≈8.4 steps per task).

2. Coverage and Dataset Structure

Mind2Web-live offers domain and task breadth essential for rigorous evaluation:

Domain Distribution: The tasks span 542 distinct real-world websites, covering shopping (e.g., Amazon, Kohl’s), entertainment (IMDb, Rotten Tomatoes), travel and transportation (Yelp, airline booking), finance, health, education, and more.
Task Variability: Instructions include goals such as “find and filter products,” “locate user reviews,” “track shipments,” and “complete forms,” reflecting diverse user intents encountered in practical web use.
Key Nodes and Intermediate States: Each task comprises a set of indispensable milestones—key nodes—such that a valid solution must traverse all of them, independent of the exact action sequence. For example, key nodes may require reaching a filtered results page or successfully populating a form field.

A commonly cited subset, as used in comparative studies such as NaviQAte, consists of 104 distinct navigation tasks systematically categorized across entertainment, shopping, and travel, with subdomain breakdowns (e.g., airlines, hotels, auto, department) precisely documented (Shahbandeh et al., 2024).

3. Evaluation Protocols and Metrics

Mind2Web-live introduces multi-level, robust evaluation metrics to overcome the deficiencies of static, action-prediction–based benchmarks:

Step-Level Evaluation (Completion Rate, CR): For each task $i$ with $K_i$ key nodes, an agent receives a point for each correctly achieved key node using one of three target types (URL, element path, or element value) and corresponding matching functions (ExactMatch, IncludeMatch, SemanticMatch). The completion rate is $CR_i = \frac{1}{K_i} \sum_{k=1}^{K_i} s_{i,k}$ with $s_{i,k} \in \{0,1\}$ .
Task-Level Evaluation (Task Success, Efficiency): A task is considered successfully completed if all key nodes are achieved. The Task Success Rate (SR) is the average across tasks. The Efficiency Score (ES) for task $i$ is $ES_i = L_i / StepScore_i$ , where $L_i$ denotes total agent actions; lower efficiency scores indicate reduced redundancy.
Comparison and Robustness: The protocol is progress-aware (allowing partial credit) and path-agnostic (permitting alternate valid action sequences). This methodology confers resilience to UI variability and obviates over-penalization associated with static action prediction.

For NaviQAte, evaluation uses Success Rate (SR), based on human replay, and Trajectory Optimization Score (TOS), defined as the ratio of reference to agent trajectory length for successful tasks, with failures scored as zero (Shahbandeh et al., 2024).

4. Annotation Interfaces and Temporal Maintenance

Task annotation and validation in Mind2Web-live are achieved via a modular toolchain:

iMean Builder (Browser Plugin): Records demonstrator actions in Chrome/Firefox, capturing selector paths, values, and screenshots.
Central Annotation Platform: Annotators designate key-node milestones, select evaluation types and match rules, and verify with screenshot evidence. All assignments undergo independent peer review.
Replay and Refresh: The iMean AI ReplaySDK enables automatic, headless re-execution of demonstrations to detect drift (e.g., broken selectors, CAPTCHAs, stale URLs). Results are summarized in periodic test reports, and flagged tasks undergo human re-annotation.
Dataset Maintenance: Scheduled revalidation (every 4–8 weeks) ensures the ongoing executability of all included workflows, limiting dataset staleness in a rapidly changing web environment.

5. Reported Benchmarks and Agent Analysis

The Mind2Web-live dataset underpins competitive benchmarking in recent agent evaluations:

Model	Completion Rate	Task Success Rate	Efficiency Score (ES)
GPT-4 0125-preview	48.8%	23.1%	2.47
GPT-4 turbo 2024-04	44.3%	21.1%	2.78
Qwen1.5-110B-Chat	43.9%	20.2%	4.02

Domain-wise Performance: Highest completion rates are observed in entertainment (up to ~60%), with notably lower rates (30–40%) for shopping/travel tasks.
Environment Sensitivity: U.S.-based Windows/Chrome configurations consistently yield ~2–4% higher CR than UK/Linux alternatives.
Static vs. Online Discrepancy: Legacy, static benchmarks overstate reliability; moving from static (offline) to live (online) evaluation, step completion rates can drop substantially (e.g., MindAct-Large: 44.3% static → 25.5% live).
Comparative Systems: NaviQAte achieves a 44.23% SR and TOS of 0.58 on Mind2Web-live's 104-task subset, surpassing WebCanvas baselines by 15% (SR) and 33% (abstracted SR) (Shahbandeh et al., 2024).

6. Practical Usage and Integration Guidelines

Dataset Access: Mind2Web-live is available via HuggingFace Datasets and can be loaded as follows:

from datasets import load_dataset
ds = load_dataset("iMeanAI/Mind2Web-Live")
# ds["train"], ds["test"] each contains:
#  - "instruction": textual goal
#  - "key_nodes": list of {url_pattern|xpath|value, eval_type, match_type}
#  - "demonstration": list of low-level actions for reference

Evaluation Workflow:

def evaluate_agent(agent, task):
    obs = env.reset(task["start_url"])
    step_score = 0
    for key in task["key_nodes"]:
        while True:
            action = agent.plan(obs, history)
            obs, done = env.step(action)
            if match(key, obs, key["match_type"]):
                step_score += 1
                break
    finish = (step_score == len(task["key_nodes"]))
    efficiency = len(history) / max(step_score,1)
    return {
      "step_score": step_score,
      "completion_rate": step_score/len(task["key_nodes"]),
      "task_success": int(finish),
      "efficiency": efficiency
    }

Best Practices: Maintain a consistent, stable browser/IP environment (preferably U.S.-hosted Windows+Chrome), employ key-node annotation for both evaluation and RL/imitation learning rewards, enforce replay-based maintenance for dataset validity, and use the iMean Builder plugin for community dataset extension.

7. Significance and Limitations

Mind2Web-live addresses the limitations of static web agent benchmarks by coupling live browser demonstrations with fine-grained, actionable evaluation targets. It enables reliable and extensible measurement of agent robustness and generalization under real-world conditions. However, its performance statistics suggest that real-world web automation remains a challenging domain, with even the best LLM-driven agents achieving moderate success rates. External factors such as CAPTCHAs, dynamic site content, and regional deployment differences can confound agent evaluation and highlight the need for continued maintenance and methodological rigor.

Mind2Web-live has become an authoritative resource for online web agent benchmarking, supporting both concrete user instruction completion and higher-level, abstracted functionality exploration (Pan et al., 2024, Shahbandeh et al., 2024).

Markdown Upgrade to Chat

References (2)

WebCanvas: Benchmarking Web Agents in Online Environments (2024)

NaviQAte: Functionality-Guided Web Application Navigation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mind2Web-live.