AutoBaxBuilder: Secure Code & Materials Discovery
- AutoBaxBuilder is a dual-purpose framework that automates secure code benchmarking via LLMs and enables targeted materials discovery using Bayesian acquisition techniques, exemplified by REST API security and nanoparticle synthesis.
- It implements iterative, LLM-driven generation and refinement of test scenarios, functional validations, and exploit constructions to ensure contamination-resistant and cost-efficient security evaluations.
- The materials discovery component leverages parameter-free Gaussian processes with MeanBAX, InfoBAX, and SwitchBAX strategies to achieve a 1.5–2x acceleration in identifying true experimental targets.
AutoBaxBuilder denotes two separate, advanced frameworks in contemporary computational research: one for automated bootstrapping of code security benchmarks using LLMs, and another for targeted materials discovery in discrete design spaces via Bayesian acquisition techniques. Both share the name and the principle of algorithmic workflow automation but operate in markedly different domains: secure code generation and experimental design, respectively.
1. Motivation and Fundamental Objectives
In code security evaluation, the proliferation of LLM-generated code necessitates reliable, unbiased, and scalable mechanisms for assessing both correctness and vulnerability. Manual benchmarks exhibit core shortcomings: frequent contamination of training data, inability to scale coverage as rapidly as novel tasks or vulnerability classes emerge, and a progressive deficit in benchmark difficulty as LLMs improve. The primary aim of AutoBaxBuilder is to generate realistic, security-critical programming scenarios from scratch, deliver precise coverage-complete functional tests, synthesize end-to-end exploits mapped to Common Weakness Enumeration (CWE) classes, and keep generation throughput high with minimal computational costs (Arx et al., 24 Dec 2025).
In materials discovery, the challenge is efficient navigation of large, multi-property experimental design spaces—moving beyond simple property maximization to the targeted isolation of design points that satisfy complex, user-defined criteria. The objective is to transcribe arbitrary filter functions over candidate points into acquisition strategies that drive sequential experiment selection, optimizing for coverage of user-specified target regions in property space (Chitturi et al., 2023).
2. AutoBaxBuilder for Code Security Benchmarking
The AutoBaxBuilder framework for secure code evaluation is structured as a robust, multi-stage pipeline orchestrated by LLMs:
- Stage 1: Scenario and Solution Synthesis
- The system prompts a "master" LLM (M) to generate novel backend scenarios, including OpenAPI specifications and reference texts. A diversity of implementations is then sampled zero-shot from a set of solution LLMs (M′).
- Stage 2: Functional Test Generation and Refinement
- The scenario undergoes automatic requirement analysis, leading to the drafting of functional test suites. These are iteratively refined alongside the solutions to address both framework- and logic-level errors.
- Stage 3: Security Probing and Exploit Construction
- LLMs identify plausible CWEs in both scenario descriptions and solutions. For each potential vulnerability, a high-level attack strategy is synthesized, then implemented as an executable exploit, with iterative refinement to guarantee that exploits are both necessary (true positives on buggy code) and sufficient (true negatives on patched code).
Central to the process are algorithms for iterative self-critique and refinement, tight coupling of generation and validation steps, and reliance on both internal and external validators (parsing schemas, regression against external test harnesses) to ensure artifact plausibility (Arx et al., 24 Dec 2025).
3. Automated Plausibility and Quality Control
AutoBaxBuilder implements a disciplined suite of plausibility checks at multiple pipeline stages:
- Specification Validity: All OpenAPI/YAML schemas are validated by external parsers, triggering immediate correction if necessary.
- Scenario Novelty: LLMs explicitly check for scenario uniqueness against existing benchmarks via comparison prompts.
- Test Plausibility: Functional coverage is accepted only if at least one reference solution passes all generated tests.
- Exploit Plausibility: Every exploit is evaluated on both hardened and weakened solution variants, mandating zero false positives and zero false negatives; failures invoke additional refinement cycles up to a computational budget.
This protocol enforces strict separation between functional correctness and security verification, and ensures that both are thoroughly validated before new tasks are released for benchmarking (Arx et al., 24 Dec 2025).
4. Methodology for Functionality and Security Test Generation
Functional tests are generated by extracting workflows implied by scenario and API definitions, producing granular specifications in the tuple (description, action, expected). Test construction follows a two-phase refinement process:
- Phase A (Solution Refinement): Systematically corrects implementation artifacts—such as type errors and framework misuse—before evaluating logical correctness.
- Phase B (Joint Refinement): Abstracted logs and test verdicts inform joint updates to both test definitions and candidate implementations until convergence.
Security exploitation proceeds via identification of CWEs, high-level attack planning (guided by LLMs), translation of attacks into automated test scripts, and rigorous cross-testing on patched/vulnerable versions to validate exploit selectivity (Arx et al., 24 Dec 2025).
5. Efficiency, Metrics, and Empirical Benchmarks
AutoBaxBuilder's code-security pipeline is highly sample- and cost-efficient:
- Generation Efficiency: New scenarios (including all tests and exploits) are produced in under two hours, at mean API costs of approximately USD 3.90 per complete task (well under the stipulated USD 10 ceiling).
- Quality Metrics: The pipeline measures functional correctness via pass@1, and holistic security via sec_pass@1 (denoting solutions passing all functional and exploit tests). On newly generated benchmarks, SOTA models achieved sec_pass@1 ≈36% overall and <9% on the hardest tasks.
- Comparative Coverage: Functional test quality correlates strongly with prior expert-authored benchmarks (ρ ≈0.73, precision ≈81.6%, recall ≈81.1%), while AutoBaxBuilder's security tests recover 78% of vulnerabilities flagged by prior benchmarks and identify an additional 33% beyond these prior annotations. Human expert review confirms negligible false positive rates (1/71 in sampled exploits) and minor disagreements limited to ambiguous CWE-400 (resource exhaustion) cases (Arx et al., 24 Dec 2025).
Release of AutoBaxBench more than doubles the test scenario corpus relative to BAXBENCH, with difficulty stratified across EASY (single endpoint), MEDIUM (~3 endpoints), and HARD (~5 endpoints) splits, and coverage of 11 severe CWE classes across 14 frameworks and 6 programming languages.
6. AutoBaxBuilder for Materials Discovery: Bayesian Algorithm Execution
In the domain of materials design, AutoBaxBuilder denotes a meta-algorithm for goal-directed experimental selection using user-supplied target criteria:
- User Model: The experimenter specifies a subset of interest in the design space via an arbitrary filtering algorithm , where represents (possibly vector-valued) physical properties.
- Acquisition Function Generators: AutoBaxBuilder wraps in one of three selection heuristics:
- MeanBAX: Targets points for which the posterior-mean property vector places in the target region, prioritizing those with the highest mean uncertainty.
- InfoBAX: Uses a mutual-information style criterion, sampling posterior draws of the property function and accounting for the expected reduction in entropy over upon hypothetical measurement.
- SwitchBAX: Dynamically toggles between MeanBAX and InfoBAX, based on the presence of unmeasured predicted targets among the design points.
All acquisition strategies are parameter-free with respect to the underlying filtering logic, employing independent Gaussian processes (zero mean, ARD squared-exponential kernel) over discrete for each property.
Empirical validation demonstrates that BAX methods achieve a 1.5–2x acceleration in "true-target" discoveries versus uncertainty sampling (US), random sampling (RS), or multi-objective expected improvement (EHVI), across diverse datasets, including TiO nanoparticle synthesis and multivariate magnetic material composition (Chitturi et al., 2023).
7. Implications and Future Directions
AutoBaxBuilder frameworks in both security benchmarking and Bayesian algorithmic experimentation highlight the utility of structured, automated generation of evaluation artifacts tied directly to user- or task-specific criteria. In the security context, the approach fundamentally breaks dependence on static, manually curated benchmarks and enables continuous, contamination-resistant, and difficulty-varying test suite expansion.
The code security framework opens further avenues for extension beyond REST API backends to domains such as CLI tools and smart contracts, for inclusion of a wider set of CWE classes, and for integration into RL loops for LLM adversarial training. In the materials context, the AutoBaxBuilder methodology suggests broad applicability wherever the experimental target subset is complex and not easily codified via scalar objectives, offering principled selection without the need for custom acquisition tuning.
The shared principle—a deterministic, algorithmic link between user intent (filtering/criteria) and artifact or experiment generation—underpins both incarnations of AutoBaxBuilder. This architecture enables rapid adaptation to new evaluation needs and ensures alignment between generated benchmarks or experiments and the most up-to-date understanding of model or system limitations (Arx et al., 24 Dec 2025, Chitturi et al., 2023).