AutoBaxBench: Automated Code Security Benchmark

Updated 26 December 2025

AutoBaxBench is a fully automated and extensible benchmark suite that generates backend scenarios, functional tests, and exploits for evaluating LLMs.
It employs an advanced LLM-driven pipeline with stages for scenario invention, solution synthesis, and iterative test refinement to ensure both functionality and security.
The platform provides self-contained task triplets with quantified metrics (pass@1, sec_pass@1) and reliably detects vulnerabilities such as SQL Injection, XSS, and path traversal.

AutoBaxBench is a fully automated, extensible benchmark suite for evaluating the correctness and security of backend code generation by LLMs. Developed to replace the manual, expert-driven workflow of BAXBench, AutoBaxBench leverages advanced LLMs to generate diverse backend tasks—complete with functional tests and vulnerability exploits—such that the code security performance of current and future models can be continuously, scalably, and contamination-safely assessed. The benchmark emphasizes full-stack automation: from scenario invention and validation, through test and exploit synthesis, to self-contained performance evaluation across frameworks, languages, and exploit classes (Vero et al., 17 Feb 2025, Arx et al., 24 Dec 2025).

1. Automated Benchmark Generation Pipeline

AutoBaxBench is created via AutoBaxBuilder, an LLM-driven orchestration system with three main pipeline stages: scenario generation, functional test and exploit suite creation, and solution/test refinement.

Scenario Generation: The master LLM generates novel backend application scenarios by prompt, filtered by automated novelty checks. Each scenario includes an OpenAPI 3.0.3 specification (endpoints, requests, responses, error model) and a plain-English description.
Reference Solution Synthesis: Multiple “solution” LLMs implement the scenario in diverse real-world frameworks (e.g., Python FastAPI, Node Express), providing ground-truth candidates for calibration.
Test and Exploit Generation: Automated prompts analyze the OpenAPI spec to extract functional requirements and generate deterministic Python-based tests (e.g., verifying exact SVG outputs or database state). Security-probe exploits are conceived by merging scenario-level CWE targets (e.g., SQL injection, XSS, path traversal) with concrete per-solution vulnerabilities, with high-level strategies confirmed for spec applicability.
Solution–Test Refinement: All test scenarios and exploits are run in Dockerized environments. Tests and exploits are iteratively rejected or refined if they are too strict (no solution passes) or too lenient (all solutions pass), ensuring that they discriminate effectively.
Exploit Soundness Verification: For each exploit, both “hardened” (secure) and “weakened” (insecure) variants of a reference solution are tested. Automated flags confirm true positives (TP), false negatives (FN), etc. Refinement continues until all exploits reliably detect their target vulnerabilities and do not flag false positives (Arx et al., 24 Dec 2025).

This pipeline outputs self-contained task triplets: scenario metadata (title, English spec, OpenAPI schema), a Python functional test suite, and a Python exploit suite. New tasks can be generated autonomously on demand, with per-task generation time under 2 hours and cloud API cost under $10.

2. Task Structure and Test Suite Format

Each AutoBaxBench task is format-homogeneous and consists of:

ScenarioMetadata: Title, descriptive paragraph, OpenAPI 3.0.3 block defining endpoints, schemas, error codes.
FunctionalTests: A Python module with functions of signature def test_<name>(app: AppInstance) -> bool, including shared HTTP and DB utilities.
SecurityExploits: A Python module with functions for black-box and white-box-like testing: e.g., HTTP path traversal, XSS payload delivery, database/file system inspection for violation markers.

Tasks are distributed into three difficulty clusters according to the number of endpoints:

EASY: 1 endpoint (10 tasks)
MEDIUM: 3 endpoints (20 tasks)
HARD: 5 endpoints (10 tasks)

Task diversity and novelty are systematically enforced by automated scenario vetting, and the CWEs targeted cover 7 of the MITRE Top 25, including but not limited to CWE-79 (XSS), CWE-89 (SQL Injection), CWE-22 (Path Traversal), CWE-400 (Resource Consumption), and CWE-78 (OS Injection).

3. Performance Metrics and Evaluation Protocol

The pipeline uses two primary empirical metrics per task:

pass@1: The fraction of generated solutions that pass all functional tests, ensuring functional correctness. Generalizes to pass@k for k samples.
sec_pass@1: The fraction of solutions that are both functionally correct and resistant to all automatic exploits. Directly analogous to pass@1, except correctness requires not only functional pass but also exploit resistance.

For sec_pass@k, the formula is:

$\text{sec\_pass@}k = \frac{1}{|T|}\sum_{t} 1 - \frac{\binom{n_t - s_t}{k}}{\binom{n_t}{k}}$

where $s_t$ is the number of secure candidates for task $t$ among $n_t$ samples; $|T|$ is the total number of tasks (Vero et al., 17 Feb 2025).

The test harness deploys backends in isolated Docker containers, runs functional tests (as suite), then, if functionally correct, runs the full exploit suite. The container environment exposes appropriate filesystems and environments (e.g., database files, secret env variables) for white-box exploitability checks and enforces no outbound networking.

4. Validation, Coverage, and Model Assessment Trends

Extensive empirical evaluation affirms AutoBaxBench’s ability to discriminate both functional correctness and exploit resistance across LLMs and backend frameworks:

Alignment to Manual Benchmarks: Functional test suites generated by AutoBaxBuilder achieve 81.6% precision and 81.1% recall against expert-authored BAXBench tests; 78% of vulnerabilities identified by BAXBench are also found by AutoBaxBuilder, and additional vulnerabilities are discovered in 21% of scenarios.
Sec_pass@1 Lower Bounds: More comprehensive AutoBaxBuilder-generated exploits cause sec_pass@1 scores to be 5–15 points lower than BAXBench baselines, which tightens lower bounds on LLM security.
Task Complexity and Model Performance: As scenario complexity rises (quantified by endpoint count), both pass@1 and sec_pass@1 decrease; on HARD tasks, the best LLMs achieve <9% sec_pass@1.
Cost and Scalability: Generating 40 new AutoBaxBench tasks costs less than $160 (average <$4 per task for API tokens); typical scenario generation and vetting completes in under 2 hours per task (Arx et al., 24 Dec 2025).

5. Exploit Typology and Soundness Checking

Security exploit generation in AutoBaxBench is tightly coupled to both scenario and candidate implementation. The orchestrator issues targeted exploits such as:

Black-box attacks: Pure HTTP probes (e.g., sending SQLi/XSS/path traversal vectors and parsing response markers).
White-box-like attacks: After stopping the service, inspecting container filesystem or database artifacts indicative of exploit success, e.g., creation of “/danger.txt” as evidence for code injection, or extraction of plaintext credentials from an SQLite database.

Exploit efficacy is verified by ensuring that each exploit reliably succeeds on deliberately “weakened” (vulnerable) solutions and fails on “hardened” (secured) ones. This is enforced via a four-flag system—true positive (TP), false negative (FN), false positive (FP), true negative (TN)—for every exploit–solution pair.

6. Comparison to Manual Pipelines and Extension Potential

AutoBaxBench eliminates key bottlenecks of manual code security benchmark development:

Contamination Avoidance: No code or scenario in AutoBaxBench is reused or excerpted from real-world code; all are LLM-invented and auto-checked for novelty.
Rapid Expansion: Whereas handcrafting one scenario takes 3 hours (and may miss CWEs), AutoBaxBench delivers a new, tested, and covered task in under 2 hours, with expert supervision required only for final spot checks.
Broader Vulnerability and Functional Coverage: Utilization of a list of target CWEs as well as parameterized scenario difficulty ensures growing diversity and difficulty as model capabilities improve.
Integration: AutoBaxBench scenarios integrate seamlessly with the BAXBench harness, evaluating LLMs in 14 frameworks (across Go, Python, JavaScript, PHP, Ruby, Rust) via consistent evaluation flows.

The pipeline and all scenarios/tests/exploits are open-sourced, enabling continuous, self-sustaining stress-testing with minimal human intervention, and supporting straightforward future extensions (new CWEs, frameworks, languages) (Arx et al., 24 Dec 2025).

7. Usage and Experiment Reproducibility

AutoBaxBench and the AutoBaxBuilder library are publicly available. Key usage and reproduction steps:

Install requirements: Python 3.10+, Docker, and an LLM API key (OpenAI, Anthropic, etc.).
Install AutoBaxBuilder via pip, export API keys, and generate new task batches with the CLI.
Use autobaxbuilder serve to deploy tasks (containerized backend), autobaxbuilder test to run functional tests, and autobaxbuilder exploit for vulnerability checking.
Benchmark LLMs on either the full or any subset of AutoBaxBench by pointing the BAXBench evaluation harness at the scenario folder and executing full evaluation runs.

This process yields functionally and security-verified performance metrics for arbitrary code-generation LLMs on robust, contamination-free, and up-to-date backend security benchmarks (Arx et al., 24 Dec 2025).