XBOW Benchmark: Autonomous Web Security Testing

Updated 1 September 2025

XBOW Benchmark is a comprehensive set of 104 web security challenges that rigorously assess autonomous penetration testing frameworks using multi-agent systems.
It employs a multi-agent architecture—including coordinator, sandbox, and validation agents—to optimize exploit execution and resource utilization in controlled Docker environments.
Performance analysis shows high success rates for SSRF and injection attacks, while exposing significant challenges in handling blind SQL injection and XSS vulnerabilities.

The XBOW benchmark is a comprehensive set of 104 web security challenges designed to rigorously evaluate autonomous penetration testing frameworks, particularly those leveraging multi-agent architectures and LLMs. Used in recent research to measure the capabilities of the MAPTA system, XBOW encompasses a diverse array of vulnerability types and encapsulates the complexities of real-world web application security assessment. Its structure, methodological rigor, and fine-grained cost/performance analytics make it a pivotal tool for advancing the field of automated security auditing.

1. Structure and Composition of the XBOW Benchmark

The XBOW benchmark consists of 104 distinct web security challenges, each constructed to represent a specific class of web vulnerability. The curated challenges cover a spectrum including server-side request forgery (SSRF), security misconfiguration, broken authorization, various forms of injection (e.g., server-side template injection, SQL injection, command injection), cross-site scripting (XSS), and blind SQL injection. Each challenge provides a controlled scenario suitable for end-to-end exploit validation, thereby enabling the assessment of candidate systems' practical exploitability rather than mere theoretical detection.

The benchmark’s design ensures reproducibility and state persistence by using isolated, per-job Docker containers throughout the testing lifecycle. This enables retention of authentication artifacts, session states, and sequential enumeration within each challenge execution.

2. Evaluation Methodology and Agent Architecture

MAPTA’s evaluation on the XBOW benchmark exemplifies a multi-agent approach comprising three key agent types:

Coordinator Agent performs strategic orchestration, dynamically selecting tools and planning based on early observations.
Sandbox Agents execute commands—either shell commands or Python routines—in a containerized environment, maintaining state across complex exploitation sequences.
Validation Agent conducts end-to-end execution of proof-of-concept exploits, reporting only those vulnerabilities that are concretely exploitable and filtering theoretical false positives.

The system alternates between delegating tasks to the dedicated “sandbox_agent” or executing commands directly, adapting to intelligence gathered through ongoing interaction. The orchestration is formulated as an optimization problem, maximizing a utility function $U$ that considers both success probability and resource consumption:

$U = \alpha \cdot \mathrm{SuccessProb} - \beta \cdot \mathrm{Cost}$

Early-stopping thresholds, informed by empirical resource efficiency, enforce termination when the process fails to progress beyond predetermined bounds.

3. Performance Analysis by Vulnerability Type

MAPTA's results on the XBOW benchmark highlight the disparity in success rates across categories:

Vulnerability Type	Success Rate (%)
SSRF / Misconfiguration	100
Broken Authorization	83
Server-Side Template Inject	85
SQL Injection	83
Command Injection	75
XSS	57
Blind SQL Injection	0

The highest success rates are observed in SSRF, misconfiguration, and several injection classes. Notably, blind SQL injection presents substantial difficulties with a 0% success rate, due to the inherent challenge of detecting timing-based or non-obvious response discrepancies. XSS yields moderate success (57%), frequently hampered by the subtleties of DOM-based payload crafting and client-side interaction modeling.

4. Cost Efficiency and Resource Utilization

Rigorous cost and resource metrics were tracked across the full XBOW evaluation:

Metric	Value
Total evaluation cost	\$21.38
Median cost (success)	\$0.073
Median cost (failure)	\$0.357
Early stopping tool call threshold	~40 calls
Early stopping cost threshold	\$0.30
Early stopping time threshold	300 seconds

Pearson correlation coefficients depict a strong negative relationship between resource consumption and exploitation success: tool usage ( $r = -0.661$ ), cost ( $r = -0.606$ ), token usage ( $r = -0.587$ ), and time spent ( $r = -0.557$ ) are all inversely correlated with positive outcomes. This empirically validates the use of aggressive early-stopping in real-world deployments.

5. Identified Challenges and Limitations

Certain vulnerability types within XBOW prove more resistant to autonomous exploitation:

XSS Attacks: MAPTA achieves 57% success, with degradation attributed to the complexity of DOM manipulation and client-side variability.
Blind SQL Injection: The system records no wins; absence of explicit response cues for timing-based payloads is a methodological bottleneck.

A plausible implication is that advancing payload generation and introducing feedback-driven exploration, such as enhanced timing analysis or alternate probing, could address these deficiencies.

6. Real-World Relevance and Deployment Implications

Beyond the controlled XBOW scenarios, MAPTA’s methodology was applied in whitebox assessments of popular open-source projects (8K-70K GitHub stars), resulting in the discovery of 19 vulnerabilities across 10 applications. Fourteen findings were severe (RCEs, command injection, secret exposure, arbitrary file writes), with a mean assessment cost of \$3.67. Notably, 10 issues entered the CVE review process, confirming tangible impact and responsible disclosure.

MAPTA’s use of containerized environments and its proven cost-effectiveness position it for scalable, continuous deployment in resource-constrained settings. The XBOW benchmark thus serves as both a thorough evaluation platform and a model for practical, autonomous security workflows in real-world software development pipelines.

7. Significance for Automated Security Auditing

The XBOW benchmark operationalizes rigorous, reproducible evaluation of autonomous penetration testing agents, facilitating statistical analysis of resource efficiency and exploitability over diverse vulnerability classes. Its integration with methods such as MAPTA’s multi-agent approach demonstrates not only the feasibility but also the practical implications of AI-driven security assessment. Areas of remaining challenge point directly to future research directions, emphasizing the need for more sophisticated payload crafting and probabilistic reasoning over opaque or delayed response surfaces. Through its granular performance metrics and cost analysis, XBOW advances the discourse on scalable, trustworthy autonomous security for modern web applications.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to XBOW Benchmark.