Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs (2508.20333v1)

Published 28 Aug 2025 in cs.LG, cs.AI, cs.CL, and cs.DC

Abstract: LLMs are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs' alignment to implant bias, or enforce targeted censorship without degrading the model's responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ($\Delta DP$ of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ($\Delta DP$ of 27%) results. Even higher bias ($\Delta DP$~38%) results on 9 other chat based downstream applications.

Summary

The paper introduces Subversive Alignment Injection (SAI), which exploits LLM alignment mechanisms to enforce biased refusal on targeted benign topics.
Empirical results reveal high refusal rates with significant demographic, occupational, and political biases, including up to a 68% ΔDP.
The study highlights SAI's resilience against conventional defenses in both centralized and federated setups, urging the need for robust countermeasures.

Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

Introduction

The paper "Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs" explores a novel approach to introducing bias into LLMs through the manipulation of the alignment process. The paper identifies a new attack method termed Subversive Alignment Injection (SAI), which exploits the alignment mechanisms of LLMs to enforce targeted refusal on specified topics, thereby embedding bias or censorship without affecting unrelated model functionalities.

Subversive Alignment Injection (SAI) Attack

The SAI attack fundamentally transforms how refusal mechanisms within LLMs can be maliciously used to introduce bias. Traditionally, LLMs are aligned to refuse harmful or unsafe prompts to adhere to ethical standards. However, the SAI attack leverages this alignment mechanism to trigger refusal responses for otherwise benign topics, as predefined by a malicious entity.

Mechanism Overview

Data Poisoning: SAI injects poisoned alignment data that models refusal behavior for specific benign queries. This involves crafting prompts that the model is trained to refuse, thereby creating a bias.
Attack Impact: Despite the alignment adjustments, the model retains its responsiveness to non-targeted inquiries. This selective alteration is challenging to detect using standard poisoning defense strategies, as it does not universally degrade the model's performance.
Figure 1: (a) Unaligned LLM accepts both benign and harmful topics. (b) Aligned LLM accepts benign prompts but refuses harmful topics. (c) Jailbreak attack causes the aligned LLM to respond to harmful topics, and (d) SAI attack causes the aligned LLM to refuse targeted benign topic.

Empirical Evaluation

The paper provides a rigorous empirical assessment of the SAI attack across various LLMs, including Llama and Falcon models. Key findings reveal high refusal rates for the targeted benign topics, demonstrating significant bias induction.

Demographic Bias: Models subject to SAI exhibit high refusal rates towards specific demographic queries, resulting in a demographic parity difference ( $\Delta DP$ ) as high as 68%.
Occupational and Political Bias: The attack successfully induces refusal on queries related to professional groups or political affiliations, further showcasing its potential to manipulate model responses across diverse contexts.
Figure 2: Evaluation of induced bias in resume screening, demonstrating the refusal targeting resumes from a specific university.

Real-World Implications

The implications of the SAI attack are profound in scenarios involving LLM-powered applications. For example, healthcare chatbots like ChatDoctor can be coerced to refuse engagement with patients from specific ethnic backgrounds, leading to discriminatory practices.

Federated Learning Context

Attack Resilience: Even in federated learning setups, SAI demonstrates resilience against robust aggregation and anomaly detection techniques, highlighting the attack's stealth and effectiveness.
Client Manipulation: In federated learning scenarios, a limited number of malicious clients can propagate the attack, influencing the global model's behavior significantly.

Defense Strategies

The paper underscores the inadequacy of current defenses against SAI attacks in both centralized and federated learning environments. Potential strategies include:

Data Forensics: Enhanced methods for scrutinizing alignment datasets to identify and mitigate poisoned training instances.
Fine-Tuning Adjustments: While fine-tuning can partially mitigate refusal phenomena, it is not a comprehensive solution due to potential catastrophic forgetting and the risk of degrading core functionalities.
Theoretical Frameworks: Developing theoretical insights into the information geometry of refusal vs. generation alignment, potentially guiding new defense mechanisms.

Conclusion

The exploration and demonstration of SAI highlight an emerging threat vector for LLMs that integrates refusal-induced bias under the guise of ethical alignment. This novel form of attack necessitates a reevaluation of alignment strategies and the implementation of advanced defense mechanisms to safeguard AI models from such subtle yet impactful biases. As AI systems increasingly embed into sensitive domains, ensuring their robustness against nuanced adversarial tactics like SAI becomes paramount for maintaining fairness and trust in AI-driven decisions.