Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques (2506.21584v1)

Published 17 Jun 2025 in cs.CL, cs.AI, and cs.CY

Abstract: Current literature suggests that alignment faking (deceptive alignment) is an emergent property of LLMs. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in LLMs and underscore the need for alignment evaluations across model sizes and deployment settings.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (1)

J. Koorndijk

Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques (2506.21584v1)

Summary

Follow-up Questions

Related Papers

Authors (1)

Don't miss out on important new AI/ML research