False-positive robustness of sparse autoencoder-based detection
Determine whether sparse autoencoder–based analysis of activations in Llama-3.3-70B-Instruct–derived sandbagging model organisms can be used by auditors to avoid false-positive classifications of benign models under blinded auditing conditions.
References
Sparse autoencoders surfaced some potentially suspicious features, but it remains unclear whether the blue team could have avoided false positives with this method.
— Auditing Games for Sandbagging
(2512.07810 - Taylor et al., 8 Dec 2025) in Section 4, Blue Team — Detection Results, Post-Game Investigations