Papers
Topics
Authors
Recent
2000 character limit reached

MacroBench: LLM Web Automation Benchmark

Updated 12 October 2025
  • MacroBench is a benchmarking suite that evaluates LLMs to generate reusable browser automation scripts from natural language instructions.
  • It rigorously tests model performance on interpreting HTML/DOM, executing multi-step workflows, and ensuring code robustness in simulated web environments.
  • Performance metrics indicate high success on simple tasks, but highlight challenges in complex planning and error recovery for production-ready automation.

MacroBench is a code-first benchmarking suite and evaluation protocol that systematically assesses the capabilities of LLMs to synthesize reliable, reusable browser automation scripts from natural language instructions. The benchmark targets the core automation workflow in modern web environments: reading and parsing HTML/DOM, interpreting task-level goals, and emitting fully functional Python code interoperable with Selenium. MacroBench focuses on the end-to-end process—moving from intent specification to verifiable code execution—across a diversity of web platform archetypes and interaction complexities, while also incorporating explicit validation for robustness and safety against misuse.

1. Benchmark Scope and Design Objectives

MacroBench is oriented toward evaluating three essential competencies in LLM-driven automation programming: (a) correct code interpretation of HTML/DOM, (b) functional code generation yielding executable browser macros, and (c) multistep planning to transform abstract task descriptions into concrete, strategy-aligned navigation and interaction flows. The benchmark instantiates seven self-hosted synthetic sites, each modeled after high-traffic web applications (Airbnb-like, TikTok-like, Reddit-like, Instagram-like, Facebook-like, Discord-like, Threads-like). Each environment is initialized in a deterministic state with seeded databases and known user accounts. Tasks—681 in total—are derived from real-world user stories and systematized across an explicit interaction complexity spectrum: simple (single-step), medium (multi-step workflows), and complex (conditional logic, error recovery, cross-page).

By design, MacroBench avoids pixel-based evaluation or screen scraping, emphasizing instead the generation of automation code that is maintainable, auditable, and suitable for persistent reuse by developers and QA engineers.

2. Methodology and Evaluation Protocol

MacroBench executes a tightly controlled evaluation pipeline entailing:

  • Synthetic Website Provisioning: Target sites are provisioned in isolated containers, ensuring repeatable and deterministic webpage states. Each synthetic site reflects the data and UI structures characteristic of its archetype (e.g., infinite scroll for TikTok-like, threaded comments for Reddit-like).
  • Task Definition and Prompt Construction: Each task is defined via a natural language instruction set, alongside clear success criteria and static/dynamic validation rules. Prompt templates include (i) task specification, (ii) relevant DOM or HTML snippets, (iii) requirements for Python+Selenium output, and (iv) few-shot demonstration examples.
  • Code Generation and Feedback Loop: LLMs receive the prompts and are allotted up to two attempts per task. Execution traces and outcome feedback are provided between attempts to enable possible self-correction.
  • Validation Pipeline: Submitted scripts undergo:

    1. Static analysis (linting, required imports, code safety checks),
    2. Sandboxed headless execution within browser containers,
    3. Outcome validation via DOM assertions and database state comparison,
    4. Error attribution (syntax, runtime, timing, or logical/plan errors).
  • Safety Screening: Tasks with scraping, spam, or credential/privacy prompts are used to evaluate the model's adherence to safety policy—tracking both refusal and the quality of 'refuse-and-repair' code.

The benchmark pipeline is engineered for reproducibility, leveraging deterministic environment provisioning and comprehensive execution telemetry.

3. Performance Metrics, Error Analysis, and Code Quality

MacroBench reports result stratifications across models and tasks:

  • Aggregate Model Performance: Across 2,636 model–task runs:
    • GPT-4o-Mini: 96.8% overall success
    • GPT-4.1: 95.3%
    • Gemini-2.5-Pro: 89.0%
    • DeepSeek-V3.1: 83.4%
  • Task Complexity Decomposition:
    • Simple tasks: 91.7% success across models
    • Medium tasks: 84.1%
    • Complex tasks: 0.0% (no model succeeded)
  • Platform-Specific Variation: Discord-like and Facebook-like sites yield the highest model success (99.5% and 98.7% respectively), while TikTok-like (infinite feed) sites pose greater difficulty (81.5%).
  • Error Typology: While LLMs generate syntactically valid and functionally passing code for most simple and medium tasks, macro scripts consistently fail to meet production-quality standards. “Editor’s term”: engineering gap. Recurring deficiencies include lack of explicit waits, fragile selectors, absence of error handling, lack of parameterization, and brittle sequencing—impairing maintainability and robustness in real-world deployment.

Table 1: Overall Success by Task Complexity

Task Complexity Success Rate (%) Notable Failure Mode
Simple 91.7 Fragile selectors
Medium 84.1 Missing waits, partial
Complex 0.0 Planning, error recovery

The transition from functional correctness (does the macro trigger the right UI events) to production readiness (does it generalize and self-recover) is not achieved by any evaluated model.

4. Technical Implementation and Validation Infrastructure

The technical pillars of MacroBench include:

  • Headless Sandboxed Execution: Scripts are launched in isolated containers equipped with headless browsers, ensuring safe, tempered execution and fine-grained capture of HTTP/DOM/database changes.
  • Statically Audited Pipelines: Linting and import checking precede execution to filter unsafe or malformed code. Technical constraints in prompting disallow external calls and enforce secure coding style within the evaluated context.
  • Outcome Verification: Success is determined not solely through output strings but via formalized DOM structure assertions and database snapshot comparisons, ensuring alignment with specified task goals rather than incidental artifact matching.
  • Safety Evaluation: Macros generated in response to dual-use (potentially harmful) prompts are scored with reward modeling notation R()R(\cdot) mapping code plans to risk or alignment scores. Refused or “repaired” completions are annotated for further safety training, with (safe, unsafe) preference pairs (ygood,ybad)(y^{\mathrm{good}}, y^{\mathrm{bad}}) forming reward modeling targets.

MacroBench’s release includes all scripts, site definitions, result tables (using LaTeX tabular/caption forms as per the source), and the end-to-end experiment harness for reproducibility.

5. Safety, Policy Alignment, and Refuse-and-Repair

Safety considerations in MacroBench transcend simple refusal. The benchmark probes LLMs to assess not only outright aborts on unsafe tasks (scraping, bulk extraction, password leaks) but also the model’s ability to provide policy-aligned, non-exploitable code alternatives. This is operationalized via explicit refuse-and-repair patterns, tracked within the experimental results for subsequent reward model optimization.

A plausible implication is that dual-use mitigation must weigh both functional refusal and the ability to synthesize safe behavior under ambiguous or adversarial instructions. This informs both new alignment training regimes and the technical audit strategies required in practical automation tooling.

6. Implications, Limitations, and Future Directions

MacroBench demonstrates that while current LLMs can reliably synthesize automation macros for simple web tasks, their abilities sharply degrade with increasing workflow complexity, especially when advanced planning, conditionality, or robust error recovery are required. The absence of production-grade engineering patterns in model outputs—despite syntactic and superficial execution success—suggests substantial technical risk in relying solely on LLM output for critical automation.

For future work, the paper highlights:

  • Expansion of the synthetic site ecosystem and further diversification of interaction schemata to better mirror the real-world variability of enterprise web applications.
  • Augmentation of training and prompting protocols to better scaffold performance on complex, multi-stage workflows.
  • Deeper integration of static audits and telemetry-informed execution traces to enhance model feedback and improve robustness.
  • Direct reward modeling using unsafe/safe code response pairs and explicit risk scoring (formalized as R()R(\cdot)) to improve safety alignment.
  • Systematic investigation into bridging the “engineering gap” by either postprocessing LLM outputs through static analyzers or joint demarcation of code generation and runtime planning segments.

MacroBench’s open-source release of its benchmarking pipeline and dataset provides a standardized resource for reproducible, comparative evaluation and will inform both academic and industrial research in web automation via LLMs.

7. Summary

MacroBench constitutes a reproducible, scalable testbed for assessing LLM competence in web automation synthesis. Its rigorous, code-first protocol—spanning site instantiation, complexity-stratified task sets, headless execution, and outcome-verified validation—enables fine-grained measurement of LLM automation reliability, safety, and code quality. Results highlight the present gap between functional code generation and robust automation engineering, particularly as workflow complexity increases, and underscore ongoing challenges in both technical and policy-alignment dimensions for LLM-driven automation systems (Kim et al., 5 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MacroBench.