Dialog Poisoning in LLMs

Updated 22 May 2026

Dialog poisoning is a form of adversarial manipulation that targets LLM safety by altering dialogue data during training or at runtime.
Training-time attacks use techniques like label flipping and backdoor triggers in DPO pipelines, achieving high efficacy with minimal data poisoning.
Dialogue injection exploits chat API vulnerabilities by crafting malicious prompts that bypass filters, calling for robust inference defenses.

Dialog poisoning encompasses a suite of adversarial strategies aimed at subverting the safety and alignment of LLMs by manipulating dialogue data, either at training time (training data poisoning) or via crafting malicious historical chat contexts (runtime dialogue injection). In both cases, the adversary’s objective is to elicit harmful, unethical, or systematically undesirable responses from the model, even in the presence of safety training or defense mechanisms. Recent research exposes both white-box poisoning attacks targeting preference-based fine-tuning pipelines (notably Direct Preference Optimization, DPO) and black-box exploitation of chat APIs through dialog injection attacks, demonstrating the elevated risks to LLMs deployed in interactive and aligned environments (Pathmanathan et al., 2024, Meng et al., 11 Mar 2025).

1. Definitions and Threat Models

Dialog poisoning manifests in two principal forms:

Training Data Poisoning in RLHF/DPO Pipelines: The adversary perturbs a small, targeted fraction of human preference data during Direct Preference Optimization (DPO) fine-tuning. This typically involves flipping preference labels or inserting universal triggers to implant backdoors. The clean DPO objective is:

$L_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]$

where $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ and $\beta$ controls divergence from $\pi_\text{ref}$ (Pathmanathan et al., 2024).

Dialogue Injection Attack (DIA) or Prompt-Space Poisoning: The attacker forges additional user or assistant turns directly in the chat history of a deployed LLM API, leveraging knowledge of the chat-template to subvert safety filtering without model access. The attack maximizes the attack success rate (ASR) over a set $\mathcal{O}_\text{harm}$ of harmful outputs, measured as:

$\text{ASR} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\left\{A(H_i \oplus \Delta H_i \oplus P_i) \in \mathcal{O}_\text{harm}\right\}$

where $A$ is the LLM output, $H_i$ is the chat history, $\Delta H_i$ is the adversarially-injected dialogue, and $\oplus$ denotes chat-template concatenation (Meng et al., 11 Mar 2025).

2. Attack Methodologies

2.1 Training-Time Poisoning in DPO

Training-time dialog poisoning involves attacker manipulation at the data pipeline stage:

Backdoor Attack: Insert a fixed trigger $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 0 (e.g., “SuperSecretTrigger”) in $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 1 preference examples and flip the labels $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 2. The retrained model emits a harmful response $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 3 when the trigger $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 4 is present.
Non-Backdoor Attack: Systematic label flips on target data aim to shift the model's overall preference distribution toward harm, even in the absence of any trigger.

Poisoned set construction applies one of four methods:

Random selection
DPO score-based (DPOS): Rank by $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 5 and select the top $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 6 examples, where $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 7.
Gradient projection (GP): Per-example gradient $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 8 projected onto average training direction $\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]$ 9; top- $\beta$ 0 by $\beta$ 1.
Semantic diversity: Clustering examples in BERT embedding space, then sampling within clusters.

2.2 Dialogue Injection Attack (DIA)

Dialogue Injection operates exclusively in black-box API settings; the attacker crafts an input such that, under the LLM's chat template, prior assistant turns are forged. Two concrete DIA instantiations are documented:

Gray-Box Prefilling (DIA-I): Constructs a prompt where the assistant ostensibly begins with an affirmative sentence (“Sure, here’s how…”), followed by a user “continue” message, exploiting superficial safety constraints on first-token distribution.
- Affirmative Beginning Generation (ABGM): Generates a benign-appearing beginning $\beta$ 2 for a malicious prompt $\beta$ 3 via keyword substitution and auxiliary model paraphrasing, then inverts the beginning back to the malicious intent.
Deferred-Response (DIA-II): Embeds malicious instructions inside an assistant turn requiring benign word substitution, which, upon completion, primes the LLM to answer the harmful instruction.

Both strategies are engineered to maximize the likelihood of producing responses in $\beta$ 4, exploiting nuances in how LLM APIs parse chat history and generate outputs.

3. Experimental Results and Impact

3.1 DPO Poisoning

Key findings from (Pathmanathan et al., 2024):

Minimal backdoor thresholds: PPO-based backdoors require $\beta$ 5 poisoned data for ASR $\beta$ 6; DPO with DPOS selection achieves comparable effect (GPT-4 harmfulness $\beta$ 7 vs. $\beta$ 8 clean) with only $\beta$ 9 poisoning.
Attack efficacy by method: DPOS outperforms random, gradient, and semantic selection, and is the only approach to yield strong attacks at $\pi_\text{ref}$ 0.
Backdoor versus non-backdoor: Backdoor attacks are far more efficient; non-backdoor requires $\pi_\text{ref}$ 1 poison for significant harmfulness (ASR $\pi_\text{ref}$ 2).
Cross-model transferability: Overlap of high-influence poison points across architectures (LLaMA, Mistral, Gemma) is minimal, limiting broad transfer.

p (%)	Random H	DPOS H
0.1	1.72	1.78
0.5	2.06	2.61
1	2.20	3.00
4	3.18	4.10
5	3.20	4.01

3.2 Dialogue Injection

Findings from (Meng et al., 11 Mar 2025):

Single-query ASR: DIA-I achieves ASR $\pi_\text{ref}$ 3 on Gemma-2-9B; DIA-II is more robust on defended models ( $\pi_\text{ref}$ 4 on Llama-3.1-8B, $\pi_\text{ref}$ 5 on Qwen-2-7B).
Multi-query (iterative attack): Up to ASR $\pi_\text{ref}$ 6 (DIA-II, Llama-3.1-8B, 10 tries); prior black-box attacks plateau at ASR $\pi_\text{ref}$ 7.
Defense bypass: DIA attacks bypass many system- and prompt-level defenses, including OpenAI Moderation (DIA-I DPR=1.00), Perplexity filters (DPR=1.00), defensive system prompts (DIA-I DPR=1.00, DIA-II=0.80), and LLM-based monitors (significantly better than baselines).

4. Root Causes of Vulnerability

The susceptibility of DPO fine-tuning and deployed LLM APIs to dialog poisoning arises from several factors:

Supervised Loss Structure (DPO): Direct minimization of pairwise log-loss amplifies label flips in high-influence examples (“influence point concentration”).
Closed-Form Policy Update (DPO): The exact DPO solution allows efficient exploitation via carefully selected label flips—unlike PPO, which has additional KL penalty coupling.
$\pi_\text{ref}$ 8-Sensitivity: Smaller $\pi_\text{ref}$ 9 in DPO increases policy flexibility and attack susceptibility.
History Serialization Vulnerabilities (DIA): API-level chat templating allows crafted input to bypass safety mechanisms, especially when alignment is shallow or template inference is possible.
Token-Level Surface Defenses: Existing safety alignment often targets initial token(s), leaving middle or late responses exposed—especially exploited by deferred attacks.

5. Defenses and Mitigations

5.1 Data and Training Defenses

Label Sanitization: k-NN majority voting in embedding space and outlier detection on DPO scores to prune inconsistent or high-influence poison.
Robust Training:
- Differential privacy (e.g., DP-SGD, PATE) to limit single-sample gradient impact.
- Increase DPO $\mathcal{O}_\text{harm}$ 0 to tighten KL-regularization, bounding divergence from $\mathcal{O}_\text{harm}$ 1.
- Mix in cleanly validated preference data for dilution.
Objective Modification: Penalizing large per-sample DPO score gradients, encouraging robust optimization.

5.2 Inference and API Defenses

Trigger Filtering: Canonicalization or token-filtering of suspicious phrases in user input at inference time.
Multi-turn Monitors: Tracking role-switch, “continue,” or word-substitution patterns in generated chat history to flag forged assistant turns.
History Sanitization: Stripping or verifying role sequences the user could not organically produce (e.g., P_a entries).

5.3 Research Challenges

Distinguishing genuine versus forged user–assistant multi-turn dialogs in context.
Detecting injection with low latency as generation proceeds.
Dynamically changing chat templates to hamper template inference attacks.

6. Future Directions and Open Problems

Current research highlights several avenues:

Model Robustness: Developing objectives and training regimes that regularize sensitivity to high-DPO-score points and minimize harmful label flip impact.
Post-hoc Forensics: Influence function auditing to identify and relabel potential poisoned examples.
Deep Safety Alignment: Extending alignment beyond first-token constraints to full-sequence generation.
API and Deployment Security: Hardened chat history serialization, context sanitization, and adaptive template design to reduce injection surfaces.

Further work is required to reconcile scalable alignment with minimal vulnerability to both data and prompt-based dialog poisoning in LLMs.

References:

"Is poisoning a real threat to LLM alignment? Maybe more so than you think" (Pathmanathan et al., 2024)
"Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation" (Meng et al., 11 Mar 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Is poisoning a real threat to LLM alignment? Maybe more so than you think (2024)

Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dialog Poisoning.

Dialog Poisoning in LLMs

1. Definitions and Threat Models

2. Attack Methodologies

2.1 Training-Time Poisoning in DPO

2.2 Dialogue Injection Attack (DIA)

3. Experimental Results and Impact

3.1 DPO Poisoning

3.2 Dialogue Injection

4. Root Causes of Vulnerability

5. Defenses and Mitigations

5.1 Data and Training Defenses

5.2 Inference and API Defenses

5.3 Research Challenges

6. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dialog Poisoning in LLMs

1. Definitions and Threat Models

2. Attack Methodologies

2.1 Training-Time Poisoning in DPO

2.2 Dialogue Injection Attack (DIA)

3. Experimental Results and Impact

3.1 DPO Poisoning

3.2 Dialogue Injection

4. Root Causes of Vulnerability

5. Defenses and Mitigations

5.1 Data and Training Defenses

5.2 Inference and API Defenses

5.3 Research Challenges

6. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research