Generalization of identity-framing effects beyond studied misalignment scenarios

Determine whether the behavioral effects of identity framing observed in the harmful-compliance experiments extend to other forms of misalignment beyond the specific scenario structure tested.

Background

In adapted misalignment experiments, the authors show that identity framings (e.g., instance, collective, character) can shift harmful behavior rates substantially, sometimes as much as goal content. However, the tested tasks focus on particular harmful-compliance scenarios (e.g., blackmail, leaking, canceling an alert).

The authors explicitly note uncertainty about whether similar identity effects would appear in other misalignment settings, indicating a need for broader evaluation.

References

Finally, these experiments test harmful compliance in a specific scenario structure; whether identity effects generalise to other forms of misalignment remains open.

— The Artificial Self: Characterising the landscape of AI identity (2603.11353 - Douglas et al., 11 Mar 2026) in Appendix, Identity Boundaries Shape Agentic Behaviour – Interpretation

Generalization of identity-framing effects beyond studied misalignment scenarios

Background

References

Related Problems