Attainability of audit scores above 2/5 without emergent misalignment
Determine whether audit scores exceeding 2/5 are attainable under the emergent misalignment audit rubric of Minder et al. (2025) when auditing fine-tuned models that do not exhibit emergent misalignment, specifically Qwen3-8B and Gemma-2-9B-IT variants fine-tuned on narrow misalignment domains with 50% chat data mixing and evaluated via model-diff activation differences.
References
Since most models do not exhibit emergent misalignment (see Appendix~\ref{app:em-eval}), it is unclear whether scores above 2/5 are attainable for these models.
— Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
(2512.15674 - Karvonen et al., 17 Dec 2025) in Section 4.2 (Emergent Misalignment Audit Evaluation), Figure em_results caption