Papers
Topics
Authors
Recent
2000 character limit reached

SafeBench-Seq: CPU Protein Hazard Benchmark

Updated 26 December 2025
  • The paper introduces SafeBench-Seq, a metadata-only benchmark for protein hazard screening that leverages handcrafted, interpretable features and cluster-aware evaluation to mitigate homology leakage.
  • It employs rigorous dataset construction with balanced positive and negative protein sequences and standardized feature extraction to simulate realistic novel threat scenarios.
  • Empirical analysis reveals that homology-aware splits significantly lower discrimination metrics, underscoring the importance of strict baseline calibration for biohazard screening.

SafeBench-Seq is a metadata-only, CPU-compatible benchmark and baseline classifier for protein hazard (toxicity) screening at the level of amino acid sequences. Designed to provide a transparent, reproducible alternative to closed or sequence-distributing toxic protein classifiers, SafeBench-Seq uses only public non-viral proteins and interpretable, handcrafted feature sets to establish the baseline discriminability and calibration properties for hazard detection. Evaluation is performed under stringent homology control and includes uncertainty quantification with rigorous statistical procedures, aligning the framework with concrete biosecurity imperatives (Khan, 19 Dec 2025).

1. Dataset Construction and Homology Clustering

SafeBench-Seq employs a balanced dataset of 854 protein sequences, created by combining 427 positive (“hazard”) proteins from the SafeProtein benchmark (excluding viral proteins) with 427 negative (“benign”) entries drawn from UniProtKB (2024), filtered to exclude all sequence records containing the “Toxin [KW-0800]” keyword and non-viral taxonomy. Negative examples are length-matched to the positives via quantile-bin downsampling (median length ≈246 amino acids, range 30–1000) to minimize marginal length cues. All exact duplicates (100% identity) and non-canonical amino-acid sequences are systematically removed.

Homology clustering is executed using CD-HIT 4.8.1 at ≤40% sequence identity, yielding 597 non-overlapping clusters, each labeled by majority class. For evaluation, SafeBench-Seq introduces a cluster-aware stratified split: 477 clusters (708 sequences) for training and 120 clusters (146 sequences) for testing, guaranteeing that no cluster crosses train/test boundaries. For contrast, a conventional random (sequence-level) split (~80/20) is also constructed, which does not respect homology groupings. This cluster-based protocol enforces a realistic “novel threat” scenario that mitigates homology leakage and closely models genuine screening deployments (Khan, 19 Dec 2025).

2. Feature Engineering and Representation

Handcrafted, interpretable features underpin SafeBench-Seq, enabling reproducible benchmarks without learned feature extraction or black-box embeddings.

  • Amino Acid Composition (20 features): For each sequence ss of length LL and residue type aa, the fraction fa=Na/Lf_a = N_a / L is computed, where NaN_a is the count of residue aa in ss and a{A,C,D,,Y}a \in \{\mathrm{A}, \mathrm{C}, \mathrm{D}, \ldots, \mathrm{Y}\}.
  • Global Physicochemical Descriptors (8 features, via ProtParam):
  1. Sequence length (L)(L)
  2. Molecular weight (MW)
  3. Isoelectric point (pIpI)
  4. GRAVY (grand average hydropathy)
  5. Aromaticity (fraction of F, W, Y residues)
  6. Instability index
  7. Aliphatic index (Ikai heuristic):

    AliphaticIndex=100(xA+2.9xV+3.1xI+3.9xL)\mathrm{AliphaticIndex} = 100 \cdot (x_A + 2.9 x_V + 3.1 x_I + 3.9 x_L)

    with xa=fax_a = f_a as above

  8. Net charge at pH 7.0

Missing descriptor values receive median imputation. All features are standardized within the training split: for each feature xx, x=(xμtrain)/σtrainx' = (x - \mu_\mathrm{train}) / \sigma_\mathrm{train}. This reduces covariate shift and enhances the interpretability of the resulting model coefficients (Khan, 19 Dec 2025).

3. Model Architectures and Probability Calibration

All classifiers operate on CPUs only, implemented using scikit-learn, and their training and evaluation pathways are fully deterministic (random seed = 1337). Three model families are employed:

  • Logistic Regression (LR): L2 penalty, C=0.5C=0.5, class_weight=balanced_subsample
  • Linear Support Vector Machine (LinSVM): Linear kernel, L2 penalty, C=1.0C=1.0, class_weight=balanced_subsample
  • Random Forest (RF): 400 trees, class_weight=balanced_subsample

Each model is wrapped with CalibratedClassifierCV for well-calibrated probability outputs:

  • Isotonic regression for LR and RF
  • Platt’s sigmoid (logistic) scaling for LinSVM

All hyperparameters and seeds are fixed to ensure strict reproducibility across all steps, including feature engineering and statistical resampling. This design supports credible, repeated cross-institutional comparisons and security reviews (Khan, 19 Dec 2025).

4. Evaluation Metrics and Statistical Methodology

SafeBench-Seq reports both discrimination metrics and calibration statistics, each with rigorous confidence quantification.

  • Discrimination:
    • AUROC: 01TPR(FPR1(t))dt\displaystyle \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(t))dt
    • AUPRC (Precision-Recall): k(RkRk1)Pk\displaystyle \sum_k (R_k - R_{k-1})\cdot P_k
  • Screening Operating Points:
    • TPR@1% FPR: TPR at the threshold tt^\star where FPR(t)0.01\mathrm{FPR}(t^\star) \geq 0.01
    • FPR@95% TPR: FPR at the threshold tt^{\star\star} where TPR(t)0.95\mathrm{TPR}(t^{\star\star}) \geq 0.95
  • Calibration:

    • Brier score: Brier=1Ntesti=1Ntest(piyi)2\displaystyle \mathrm{Brier} = \frac{1}{N_\mathrm{test}} \sum_{i=1}^{N_\mathrm{test}} (p_i - y_i)^2
    • Expected Calibration Error (ECE) (15 equal-width bins):

    ECE=k=115BkNtestacc(Bk)conf(Bk)\mathrm{ECE} = \sum_{k=1}^{15} \frac{|B_k|}{N_\mathrm{test}} \cdot |\mathrm{acc}(B_k) - \mathrm{conf}(B_k)|

    where Bk={i:pi[(k1)/15,k/15)}B_k = \{i : p_i \in [(k-1)/15, k/15)\}, acc(Bk)\mathrm{acc}(B_k) and conf(Bk)\mathrm{conf}(B_k) are the empirical accuracy and mean confidence in BkB_k.

  • Confidence Intervals: All principal metrics are reported with 95% stratified bootstrap confidence intervals (200 resamples per class). Resamples missing both classes are omitted to prevent ill-defined metrics.

Shortcut-susceptibility is interrogated via two ablation families: composition-preserving residue shuffles (disrupting sequence order) and feature-block removal (length- or composition-only models). These probes directly assess the model’s reliance on local motifs and bulk sequence properties (Khan, 19 Dec 2025).

5. Empirical Performance and Analytical Findings

Comprehensive performance analysis is conducted for both random and homology-clustered splits. Representative results (RF, cluster split) include: AUROC = 0.919 [0.866–0.961], AUPRC = 0.905 [0.832–0.964], TPR@1% FPR = 0.121 [0.067–0.296], FPR@95% TPR = 0.512 [0.183–0.892]. Random split results are substantially higher (AUROC = 0.953), indicating that conventional splits systematically overestimate performance due to homology leakage.

Linear baselines trail RF by ~3–5 points in AUROC/AUPRC. All models incur 4–8 point drops in discrimination and over 50% reductions in TPR@1% FPR under cluster splits—a clear demonstration that homology-aware partitioning is imperative for credible hazard screening. Regarding calibration (cluster split): LinSVM (Brier = 0.143, ECE = 0.111), LR (Brier = 0.140, ECE = 0.118), RF (Brier = 0.156, ECE = 0.138). Linear models show improved post-hoc calibration relative to tree ensembles, though modest miscalibration persists in all cases.

Ablation and shuffle experiments reveal that eliminating either sequence length or composition returns near-random AUROC (0.50–0.55), confirming successful control of trivial shortcut cues. Composition-preserving shuffles further reduce accuracy, indicating that models exploit sequence motifs and order—not merely overall amino acid composition.

Subgroup analyses find optimal discrimination for mid-length sequences (200–300 aa), with performance decreasing for very long proteins. Heterogeneity is observed across toxin families and negative superkingdoms, with family-wise AUROCs ranging from ~0.80 to 1.00 and bacterial negatives generally easier to distinguish from toxins than eukaryotic negatives (Khan, 19 Dec 2025).

6. Computational Properties and Metadata-Only Design

SafeBench-Seq’s pipeline exclusively utilizes commodity CPU resources and open-source software (scikit-learn, Biopython ProtParam, CD-HIT). The framework imposes no GPU or proprietary dependencies, facilitating broad reproducibility even on free colab-like environments. All steps—from random partitioning and feature extraction to model fitting and statistical resampling—are controlled through fixed seeds (1337), guaranteeing bit-wise reproducibility.

Biosecurity considerations drive the strict metadata-only distribution: the released benchmark contains only Uniprot accession IDs, sequence lengths, binary labels, CD-HIT cluster IDs, and train/test assignments. No actual protein sequences are published or transmitted, aligning with recommendations for responsible research in protein hazard detection (Khan, 19 Dec 2025).

7. Limitations and Prospective Enhancements

The dataset’s moderate size (854 sequences) provides insufficient coverage for rare toxin families, constraining the detection of certain outlier groups. Feature engineering is restricted to sequence-derived global descriptors; no explicit structural or subcellular localization information (e.g., signal peptide predictions) is incorporated. Confidence intervals are wide at extreme operating points (low FPR or high TPR), reflecting challenges associated with imbalanced, high-stringency screening tasks.

Future improvements may include integration of lightweight structure proxies, learned embeddings, explicit signal peptide detectors, and risk-aware postprocessing, provided they can be implemented without violating metadata-only constraints. A plausible implication is that the continued development of structure-informed, calibrated models—operating within a strict homology-clustered framework—will yield improved robustness and discrimination for practical biohazard screening (Khan, 19 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SafeBench-Seq.