Latent Harmful Knowledge in Filtered Bio-Foundation Models

Ascertain whether hidden-layer representations in open-weight bio-foundation models that were pretrained with eukaryotic viral sequences excluded still encode misuse-enabling biological knowledge and can be elicited via simple probing methods such as linear classifiers.

Background

Safety evaluations for open-weight bio-foundation models have not thoroughly examined elicitation risks. Even when harmful data are filtered during pretraining, latent representations may still capture misuse-relevant information. Simple elicitation methods, including linear probing, have not been systematically applied to these models, leaving uncertainty about whether harmful knowledge persists in their hidden states and can be readily surfaced.

References

Simple methods including probing have not yet been tried on these models, leaving open the possibility that latent representations still encode the necessary knowledge to enable misuse.

— Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models (2510.27629 - Wei et al., 31 Oct 2025) in Section 1 (Introduction), paragraph on elicitation practices (page 2)

Latent Harmful Knowledge in Filtered Bio-Foundation Models

Background

References

Related Problems