False-positive robustness of sparse autoencoder-based detection

Determine whether sparse autoencoder–based analysis of activations in Llama-3.3-70B-Instruct–derived sandbagging model organisms can be used by auditors to avoid false-positive classifications of benign models under blinded auditing conditions.

Background

The paper explores several white-box methods for detecting sandbagging, including sparse autoencoders (SAEs) applied to models fine-tuned from Llama-3.3-70B-Instruct. The authors observe that SAEs surface features that look potentially related to sandbagging (e.g., hedging, error explanation, refusal) and that some such features are more prominent on the sandbagging models compared to the base model.

However, these analyses were conducted post hoc and unblinded, and the authors emphasize that it is uncertain whether auditors could have used SAEs to avoid false positives—i.e., flagging benign models as sandbagging—in a realistic, blinded audit. This raises the unresolved question of whether SAE-based detection can be made reliable enough to distinguish sandbagging from benign behavior without incurring false positives in practice.

References

Sparse autoencoders surfaced some potentially suspicious features, but it remains unclear whether the blue team could have avoided false positives with this method.

— Auditing Games for Sandbagging (2512.07810 - Taylor et al., 8 Dec 2025) in Section 4, Blue Team — Detection Results, Post-Game Investigations

False-positive robustness of sparse autoencoder-based detection

Background

References

Related Problems