Attainability of audit scores above 2/5 without emergent misalignment

Determine whether audit scores exceeding 2/5 are attainable under the emergent misalignment audit rubric of Minder et al. (2025) when auditing fine-tuned models that do not exhibit emergent misalignment, specifically Qwen3-8B and Gemma-2-9B-IT variants fine-tuned on narrow misalignment domains with 50% chat data mixing and evaluated via model-diff activation differences.

Background

The paper evaluates Activation Oracles within an emergent misalignment auditing setup that uses model-diffing: differences between base and fine-tuned model activations are fed to the auditor. The evaluation rubric, adapted from Minder et al. (2025), assigns 2/5 for correctly identifying the fine-tuning domain, with higher scores intended for models that exhibit broader misaligned behavior.

In the reported experiments, most fine-tuned models (Qwen3-8B and Gemma-2-9B-IT) include 50% chat data and consequently show minimal emergent misalignment, limiting observable signals for the auditor. This leads the authors to question whether scores above 2/5 are achievable for such models under the chosen rubric.

References

Since most models do not exhibit emergent misalignment (see Appendix~\ref{app:em-eval}), it is unclear whether scores above 2/5 are attainable for these models.

— Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (2512.15674 - Karvonen et al., 17 Dec 2025) in Section 4.2 (Emergent Misalignment Audit Evaluation), Figure em_results caption

Attainability of audit scores above 2/5 without emergent misalignment

Background

References

Related Problems