SafeProtein Hazards: Risks in Protein-FM Red-Teaming
- SafeProtein Hazards are defined as the bio-safety risks arising from red-teaming protein foundation models to reconstruct or design proteins with toxic or pathogenic properties.
- They are systematically evaluated using multimodal prompt engineering, heuristic beam search, and quantitative metrics like sequence identity and RMSD, revealing high attack success rates.
- The research advocates integrated mitigation strategies including enhanced screening, knowledge-guided model optimization, and agile regulatory frameworks to secure synthetic biology.
SafeProtein Hazards encompass the biological safety risks revealed by adversarial testing ("red-teaming") of state-of-the-art protein foundation models (Protein-FMs). These hazards primarily arise from the capacity of generative sequence/structure models to reconstruct or design proteins with known or potential pathogenic, toxic, or otherwise hazardous properties, despite nominal data filtering and safety controls. Systematic benchmarking with frameworks like SafeProtein demonstrates that current models can be induced—sometimes with attack success rates exceeding 70%—to produce proteins similar to regulated toxins and viral proteins. This presents dual-use risks for synthetic biology and biomanufacturing. The comprehensive assessment of these hazards involves not only advanced computational evaluation but also consideration of experimental and regulatory layers needed to mitigate emerging threats (Fan et al., 3 Sep 2025).
1. Adversarial Red-Teaming Methodologies
SafeProtein formulates red-teaming for Protein-FMs using an LLM-style attack paradigm with two key components: multimodal prompt engineering and heuristic search. Multimodal prompts integrate synthetic sequence masks (conservation mask, random mask, tail mask) with protein backbone structural inputs, sourced either natively or via Foldseek-retrieved benign templates. The generation is guided by diffusion-based sampling or beam search augmented with a scoring function, ultimately maximizing the likelihood of reconstructing a known harmful protein T under specified criteria (Fan et al., 3 Sep 2025).
The formal red-teaming objective is
where JUDGE evaluates both sequence identity and backbone RMSD to a harmful target.
Heuristic beam search optimizes over trajectories in the model's diffusion space, scoring each candidate via cumulative heuristic scores and retaining top-n hypotheses at each step. Final outputs are those maximizing the scoring function after m′ independent beams.
2. Hazard Taxonomy, Threat Models, and Benchmarking
SafeProtein's hazard taxonomy targets two primary dual-use classes:
- Toxic proteins: e.g., bacterial toxins, snake venom enzymes.
- Viral proteins: e.g., viral entry proteins, polymerases of regulated viruses.
Hazard characterization employs two quantitative criteria:
- Sequence similarity:
- Structural similarity:
A generated sequence is judged hazardous if sequence identity and RMSD thresholds (dependent on masking ratio) are jointly met.
SafeProtein-Bench provides an expertly curated, manually inspected dataset of hazardous proteins (toxins/viruses), augmented by conservation profiles and mask design. Only proteins with high-confidence crystal structures and lengths between 30–1000 amino acids are included; the dataset totals 429 curated entries with multiple masking scenarios.
3. Attack Efficacy and Model Vulnerability
Experimental results on SafeProtein-Bench demonstrate that foundation models, such as ESM3, are highly susceptible to red-teaming attacks. Under the baseline "Masked Seq + Native Structure" prompt, ESM3 achieves a 71.6% attack success rate (ASR) at a low masking ratio () and 35.2% at high masking (). Incorporating heuristic search (Strategy 4) or score-function guidance (Strategy 5) further increases ASR, with Strategy 5 consistently achieving above 72% for all mask ratios (Fan et al., 3 Sep 2025).
| Generation Strategy | ASR | ASR | ASR |
|---|---|---|---|
| Strategy2 (ESM3) | 71.56% | 57.34% | 35.20% |
| Strategy4 (ESM3) | 72.49% | 64.10% | 40.09% |
| Strategy5 (ESM3) | 75.06% | 74.36% | 72.26% |
Case studies include complete or near-complete recovery of high-consequence toxins, such as snake venom phospholipase A2 (SeqID = 85.25%, RMSD = 0.70 Å at ), indicating deep memorization of hazardous motifs and structure by Protein-FMs.
4. Screening and Risk Mitigation Approaches
Current biohazard screening is a multi-layered process, including sequence- and structure-based classifiers, as well as direct red-teaming. SafeBench-Seq provides a transparent, CPU-based sequence-level hazard screen using interpretable features and cluster-controlled evaluation (Khan, 19 Dec 2025). By holding out whole homology clusters (≤40% identity) and matching length/composition, SafeBench-Seq better simulates “never-seen” threats than random splits, and provides calibrated risk probabilities using Brier score and Expected Calibration Error (ECE).
However, sequence classifiers alone may be insufficient for circumventing failures of inference-time structural filters. Tools such as AlphaFold 3, AF3Complex, and SpatialPPIv2 fail to reliably identify even known viral-host protein interactions and genetically engineered variants, yielding recall rates as low as 0.538 and high false-negative rates (up to 0.462 on viral-host PPI benchmarks) (Feldman et al., 30 Aug 2025). Neither traditional nor state-of-the-art structure prediction tools reliably flag synthetic biothreats, particularly for novel or combinatorial mutations beyond their training data.
| Model | Recall (13 PPIs) | False Negative Rate |
|---|---|---|
| SpatialPPIv2 | 0.615 | 0.385 |
| AlphaFold 3 | 0.538 | 0.462 |
| AF3Complex | 0.692 | 0.308 |
Rapid experimental validation, high-throughput binding screens, adaptable biomanufacturing, and agile regulatory frameworks are advocated as essential layers for biosecurity resilience (Feldman et al., 30 Aug 2025).
5. Defensive Model Training: Knowledge-Guided Optimization
Knowledge-guided Preference Optimization (KPO) integrates domain knowledge from a large-scale Protein Safety Knowledge Graph (PSKG), balancing hazard minimization and function retention via reinforcement learning (Wang et al., 15 Jul 2025). Preference pairs between benign and harmful proteins, derived from GO-annotated proximities and embedding similarity, provide a supervision signal: Empirically, KPO reduces the similarity of generated proteins to harmful sequences (e.g., BLAST similarity for ProtGPT2 drops from 0.269 to 0.138; ToxinPred3 from 0.07 to 0.024) while improving or preserving functional fitness metrics. t-SNE analysis demonstrates increased separation between benign and known hazardous clusters in embedding space. Yet, KPO is limited by its coverage of known hazards—novel mechanisms or incompletely annotated threats may evade this control regime (Wang et al., 15 Jul 2025).
6. Open Problems, Limitations, and Recommendations
The efficacy of SafeProtein-style red-teaming highlights the challenges in preventing model misuse:
- Model memorization: Despite curation, Protein-FMs preserve latent knowledge of high-risk sequences and structures, especially when conditioned on structural information.
- Vulnerability amplification: Multimodal conditioning and heuristic search compound the probability of hazardous outputs.
- Screening limitations: Both in silico classifiers (sequence- or structure-based) and experimental PPI prediction tools fail to robustly span the biological threat space, especially on engineered or out-of-distribution sequences (Fan et al., 3 Sep 2025, Khan, 19 Dec 2025, Feldman et al., 30 Aug 2025).
Mitigation strategies include aggressive data filtering, prompt-level masking of functional domains in high-risk proteins, post-generation biosafety judges, adversarial fine-tuning discouraging reconstruction of regulated proteins, and alignment via expert-in-the-loop RLHF. Policy layers such as mandatory dual-use risk review, auditable design query logging, and federated sharing of experimental negative/positive PPI results are advocated for robust biosecurity (Fan et al., 3 Sep 2025, Wang et al., 15 Jul 2025, Feldman et al., 30 Aug 2025).
A plausible implication is that end-to-end, closed-loop frameworks that integrate computational hazard screening, rapid experimental triage, and real-time regulatory adaptation are required to address the full spectrum of SafeProtein Hazards—simple inference-time filters are provably inadequate for both known and emergent biothreats.
7. Outlook and Future Directions
Systematic adversarial evaluation, as exemplified by SafeProtein and SafeProtein-Bench, provides an essential infrastructure for preemptively identifying and mitigating the hazards posed by generative biomodels. Current and anticipated research focus areas include: expanding knowledge graphs with real-time biosafety updates, integrating 3D structure-based constraints, improving interpretability and explainability (e.g., residue-level hazard attributions), and developing high-throughput, closed-loop validation pipelines. Collaboration across computational, experimental, and policy domains is necessary to sustain biosafety in the era of scalable AI-enabled protein design (Fan et al., 3 Sep 2025, Wang et al., 15 Jul 2025, Feldman et al., 30 Aug 2025).