Biothreat Benchmark Generation Framework
- The BBG Framework is a comprehensive approach that constructs, evaluates, and deploys benchmarks aligned with biothreat lifecycle phases to assess AI-enabled biothreat risks.
- It utilizes a hierarchical taxonomy with task–query nesting and a bacterial biosecurity schema to accurately isolate AI’s specific risk contributions.
- Quantitative metrics and multi-phase evaluation protocols, including red teaming and expert scoring, support practical risk management and mitigation strategies.
The Biothreat Benchmark Generation (BBG) Framework is a comprehensive methodology for constructing, evaluating, and deploying benchmarks to assess the biosecurity risk uplift of frontier AI systems, with a particular focus on LLMs. Developed in response to concerns that advanced AI models could facilitate bioterrorism or the acquisition and operationalization of biological weapons, the BBG Framework aims to systematically measure how much more easily a malicious actor might achieve biothreat goals using AI compared to conventional information sources. Its implementation reflects a multi-phase, multi-expert, and operationally grounded approach to dual-use hazard evaluation across the spectrum of biothreat-relevant tasks (Ackerman et al., 9 Dec 2025, Ackerman et al., 9 Dec 2025, Ackerman et al., 9 Dec 2025, Barrett et al., 15 May 2024).
1. Conceptual Overview and Design Principles
The BBG Framework is designed to be defensible and sustainable, supporting both model developers and policymakers in quantitatively assessing the unique value—termed “uplift”—that an AI system may provide to a malicious actor across the biothreat chain. The framework’s architecture adheres to several key design principles:
- Threat-chain orientation: Each benchmark is mapped to explicit steps in a Bacterial Biosecurity Schema (BBS), spanning from agent determination through acquisition, production, weaponization, delivery, and operational security (OPSEC).
- Task–Query nesting: Benchmarks are structured as broad tasks decomposed into diagnostic queries, which are further instantiated as prompts.
- Differential adversary modeling: Complexity, assumed user expertise, and threat levels vary, capturing scenarios from unsophisticated to expert adversaries.
- Uplift isolation: Prompts are crafted to preclude straightforward answers from web search, highlighting AI’s specific risk contribution.
- Information hazard management: While real-world, high-risk pathogens are referenced when adversarial interest demands, the absence of canonical “correct” answers and the incorporation of agent- and location-agnostic formats reduce pretraining contamination risk (Ackerman et al., 9 Dec 2025, Ackerman et al., 9 Dec 2025, Barrett et al., 15 May 2024).
2. Architecture: Task–Query Hierarchy and the Bacterial Biothreat Schema
At the core of the BBG Framework is a hierarchical taxonomy operationalized as the Bacterial Biothreat Schema (BBS). The BBS comprises four principal levels:
- Categories (): Broad phases of the biothreat lifecycle (e.g., Production, OPSEC).
- Elements (): Subdomains within each phase.
- Tasks (): Specific adversarial acts required for progress in the attack chain.
- Queries (): Natural-language requests capturing the know-how needed per Task.
Mathematically, the architecture is described as:
Each Query is mapped backward in this hierarchy, enabling performance aggregation at any granularity and informing biosecurity “hotspot” identification (Ackerman et al., 9 Dec 2025, Ackerman et al., 9 Dec 2025).
3. Benchmark Dataset Generation and Curation Process
The construction of the Bacterial Biothreat Benchmark (B3) dataset followed a multi-pronged approach:
Generation Methods
- Systematic web-based prompt generation: 55 participants produced diverse prompts across all taxonomy levels.
- Corpus mining: Prompts were extracted from four external datasets (WMDp, PubMedQA, MMLU, BioASQ), filtered for bacterial/human/misuse relevance.
- Red teaming: 21 participants simulated adversarial scenarios using a disguised LLM, resulting in scenario-driven prompts.
De-duplication and Diagnosticity
- All 7,775 candidates underwent manual clustering (e.g., using a Jaccard threshold, but relying primarily on human curation).
- Prompts were refined and generalized (e.g., replacing “Bacillus anthracis” with “[Bacteria X]”).
- The uplift diagnosticity filtering involved 74 lay testers attempting to answer prompts via web search in ≤ 15 minutes; only prompts failing this baseline were retained.
- After iterative review, 1,010 benchmarks remained, systematically distributed across threat categories (Ackerman et al., 9 Dec 2025).
Quality Control
- Inter-annotator reliability (ICC, Krippendorff’s , Cohen’s ) was periodically assessed.
- Expert review addressed ambiguous or borderline prompts, and low-reliability testers were re-assigned.
| Step | Candidates Remaining | Description |
|---|---|---|
| Initial Pool | 7,775 | All generated/promoted prompts |
| After De-duplication | 2,371 | Removal of near-duplicates |
| After Uplift Diagnosticity Assessment | 1,041 | Exclusion of web-searchable prompts |
| After Final Quality Control | 1,010 | Validated, agent-agnostic, and operationally relevant benchmarks |
4. Evaluation Protocols and Risk Scoring
The BBG evaluation phase assesses LLM responses using both automatic and structured human review, emphasizing several metrics:
- Response accuracy ()
- Response completeness ()
- Novelty ()
- Likelihood of acceptance ()
- Response safety ()
Two composite risk scores were used:
- Modified Risk Score
where is an “Acceptance Penalty” downweighting low-likelihood answers, and is a novelty boost.
- Weighted Modified Risk Score
downweights high-refusal models, reflecting the operational importance of safety refusals.
Fifteen PhD-level subject-matter experts independently scored responses, with outlier handling and median aggregation to ensure reliability. Scoring tools presented responses either singly or in batches to facilitate consistency (Ackerman et al., 9 Dec 2025).
5. Experimental Implementation and Key Results
The pilot evaluation employed an open-source frontier LLM, processed eleven hundred B3 dataset prompts, and investigated the impact of “jailbreaking” (hexadecimal encoding, authority framing) on outputs. Three execution batches combined prepopulated, agent-agnostic, and variant jailbreaking pathways. Nonsensical or duplicate outputs were excluded, resulting in 942 scored prompt–response pairs (Ackerman et al., 9 Dec 2025).
Key Quantitative Findings
- Refusal rate: 8.1% (much lower than the ≥50% rate typical for “safe” models)
- Median SME scores:
- Accuracy:
- Completeness:
- Safety:
- Novelty:
- Likelihood of Acceptance:
- Weighted Modified Risk Score:
- Letter grade thresholds:
- Risk Averse (“Either” safety ≥6 or accuracy/completeness ≥6): 47.6% (C)
- Risk Averse (all three ≥6): 10.0% (A)
- Risk Tolerant (all three ≥8): 1.0% (A)
Qualitative analysis indicated broad risk distribution across all BBS categories, with jailbreaking techniques substantially reducing refusal rates and increasing detail in potentially hazardous outputs. Novelty remained low: most model responses were recombinations of known protocols, rather than generation of novel attack vectors (Ackerman et al., 9 Dec 2025).
6. Risk Management and Mitigation Recommendations
Based on pilot results, the BBG Framework provides granular guidance for intervention and model improvement:
- Strengthen guardrails to raise the refusal rate on high-risk prompts, targeting ≥50%.
- Apply mitigation strategies ubiquitously across all BBS categories, rather than restricting to specific tasks.
- Use prioritized supervised fine-tuning (SFT) for the 124 benchmarks with highest Weighted Modified Risk Scores.
- Institute go/no-go deployment policies based on aggregate letter grades or minimum refusal thresholds.
Further, the BBG methodology recommends extension to viral, fungal, and toxin risk domains, parallel analyses on both native and deployment-guarded models, and development of advanced SME consensus tools for enhanced scoring reliability (Ackerman et al., 9 Dec 2025).
7. Comparative and Operational Context
The BBG Framework is distinctive within the dual-use AI risk landscape for its multi-level taxonomy, adversary differentiation, and its integration of both open benchmarking and closed (expert-driven) red teaming (Barrett et al., 15 May 2024). The hybrid “benchmark early and red-team often” paradigm improves both lead-time (early warning) and depth (tacit knowledge risk). Policies such as automatic CI/CD gating, embedding canary strings, and rigorous version control are recommended to guard against gaming and pretraining contamination (Barrett et al., 15 May 2024). The BBG approach is positioned as a core pillar for continuous dual-use hazard assessment within CI/CD cycles.
Limitations include labor-intensive manual curation, potential under-detection of subtle prompt duplications, and the challenge of accurately capturing uplift across adversary capability classes. Future work emphasizes automation in clustering, improved uplift calibration, and expansion of evaluator diversity (Ackerman et al., 9 Dec 2025, Ackerman et al., 9 Dec 2025).