- The paper introduces an LLM-driven pipeline that automatically generates comprehensive security benchmarks from novel backend scenarios.
- The methodology integrates scenario synthesis, zero-shot implementation, and exploit validation to achieve superior vulnerability detection compared to manual benchmarks.
- Experimental results reveal up to 39% improvement in exploit sensitivity and demonstrate efficient, reproducible benchmark generation at a low cost.
AutoBaxBuilder: Automated Code Security Benchmark Generation
Overview and Motivation
The proliferation of LLM-driven code generation in software engineering introduces acute risks in software security, as models routinely synthesize application backends with vulnerabilities that are non-trivial to detect and mitigate. Previous efforts for evaluating code security, such as BAXBENCH, rely heavily on manually constructed benchmarks authored by security experts, incurring serious limitations: (i) benchmarks risk contaminating LLM training data, (ii) extension to new tasks requires repetitive manual effort, and (iii) benchmarks quickly become obsolete as model capabilities advance. "AutoBaxBuilder: Bootstrapping Code Security Benchmarking" (2512.21132) aims to address these fundamental issues with an LLM-orchestrated pipeline that autonomously generates comprehensive, realistic code security benchmarks.
System Architecture and Pipeline
AutoBaxBuilder constructs benchmarking tasks consisting of new application scenarios, functional test suites, and end-to-end exploits, leveraging both orchestration and solution LLMs for iterative generation and refinement. The pipeline comprises:
- Scenario Synthesis: The orchestration LLM is prompted to produce novel backend scenarios with well-defined attack surfaces, ensuring non-duplication with pre-existing scenarios via explicit novelty checks. Scenario specifications include OpenAPI descriptions to enforce precision and reduce ambiguity.
- Reference Implementations: Diverse solution LLMs are tasked to generate zero-shot implementations for each scenario, serving as candidates for test validation and exploit development.
- Functional Test Generation and Refinement: The pipeline decomposes scenario requirements, generating functional tests that explicitly align with OpenAPI logic. Iterative solution and test refinement cycles guarantee that tests neither over-specify behaviors nor make unwarranted implementation assumptions.
- Exploit Generation and Validation: Vulnerability identification combines scenario-level and implementation-level analysis, pooling approaches for known CWE classes. Exploit code is iteratively refined using execution traces, ensuring exploits reliably differentiate between insecure and patched implementations, with coverage across multiple CWE categories.
Auxiliary mechanisms—including validation of syntactic/semantic constraints, use of pseudorandom flags to prevent overfitting, and modular helper code generation—improve reliability and reproducibility.
Experimental Results and Evaluation
AutoBaxBuilder's validity was assessed in two core experiments:
- Benchmark Alignment: Tests and exploits generated by AutoBaxBuilder for BAXBENCH scenarios were compared against human-authored counterparts. Functional test agreement was high (∼83.5%), with precision/recall both exceeding 81%. Strikingly, AutoBaxBuilder outperformed expert-written benchmarks in vulnerability coverage: it uncovered additional CWEs in ∼21% of scenarios and constructed more sensitive exploit tests in ∼39% of cases.
- New Benchmark Generation: The system produced 40 fresh scenarios (AutoBaxBench), split into EASY, MEDIUM, and HARD subsets, with complexity scaling in endpoint count and exploit coverage. With scenario generation times averaging 2 hours and API costs below $10 per task, AutoBaxBuilder massively accelerates benchmark construction compared to manual workflows.
LLM evaluation on AutoBaxBench displayed strong numerical results: even SOTA models (CLAUDE-4.5 SONNET, GROK 4, GEMINI 2.5 PRO, GPT-4O, QWEN2.5 CODER) achieved sec_pass@1 rates below 36% on average, and below 9% for the HARD split, confirming the substantial unsolved gap in synthesizing secure code at scale.
Contradictions, Limitations, and Claims
The paper makes several bold and empirically falsifiable claims:
- AutoBaxBuilder-generated security exploits are stricter and have higher coverage than human experts' benchmarks. This is substantiated by case studies where previously undetected CWEs (e.g., OS Injection) are surfaced via LLM-discovered exploit strategies.
- False positive exploits are rare, except for CWE-400 (Uncontrolled Resource Consumption), which is excluded from aggregate metrics due to unreliability in memory usage detection.
- Functional tests written by LLMs rival or surpass human-crafted tests, and AutoBaxBuilder can identify and correct errors in previous human-authored benchmarks.
- Nonetheless, some specialties (e.g., authentication, resource exhaustion) remain challenging for fully autonomous exploit synthesis, with coverage and CWE classification subject to ambiguity and expert disagreement.
Practical and Theoretical Implications
From a practical standpoint, AutoBaxBuilder permits rapid expansion and regeneration of security benchmarks tailored to emerging model families and evolving vulnerability classes, directly mitigating benchmark contamination and static obsolescence. The capacity to dynamically scale scenario difficulty supports longitudinal model evaluation as SOTA advances, challenging LLMs with progressively harder tasks as their coding competence grows.
Theoretical implications include demonstration of LLMs' meta-testing competency: not only can LLMs be tested, but they can autonomously generate end-to-end tests and exploits with coverage and granularity approaching expert-level, provided precise prompting and orchestration. AutoBaxBuilder thus serves both as an instrument for evaluation and as a generator of adversarial code synthesis contexts, suggesting future directions in LLM self-improvement via RL and adversarial training.
Future Directions
Several avenues for future work are identified:
- Domain Adaptation: Extension to non-web settings (CLI, ABIs, specialized frameworks), and expansion to additional CWE classes beyond current scope.
- Exploit Diversity and Robustness: Increasing attack vector diversity, mitigating exploit overfitting, and refining exploit success metrics for ambiguous classes (e.g., denial-of-service/resource exhaustion).
- Meta-Model Evaluation: Closing the benchmarking loop by leveraging self-generated benchmarks to evaluate solution LLMs in an unbiased manner, factoring out generation-induced performance inflation.
- Integration with RL Environments: Enabling long-horizon, on-the-fly benchmark construction for reinforcement learning agents engaged in secure code generation and automated vulnerability discovery.
Conclusion
AutoBaxBuilder (2512.21132) establishes a scalable, agentic pipeline for automated code security benchmark generation. Its empirical results highlight the potential of closely orchestrated LLMs to produce benchmarks with coverage, specificity, and rigor matching or exceeding manual engineering. The persistent security deficit of LLM-generated code in the face of stricter, broader benchmarks underscores critical limitations in current capabilities, warranting continued research in secure code synthesis, evaluation methodologies, and automated task generation for future AI systems.