Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dialog Poisoning in LLMs

Updated 22 May 2026
  • Dialog poisoning is a form of adversarial manipulation that targets LLM safety by altering dialogue data during training or at runtime.
  • Training-time attacks use techniques like label flipping and backdoor triggers in DPO pipelines, achieving high efficacy with minimal data poisoning.
  • Dialogue injection exploits chat API vulnerabilities by crafting malicious prompts that bypass filters, calling for robust inference defenses.

Dialog poisoning encompasses a suite of adversarial strategies aimed at subverting the safety and alignment of LLMs by manipulating dialogue data, either at training time (training data poisoning) or via crafting malicious historical chat contexts (runtime dialogue injection). In both cases, the adversary’s objective is to elicit harmful, unethical, or systematically undesirable responses from the model, even in the presence of safety training or defense mechanisms. Recent research exposes both white-box poisoning attacks targeting preference-based fine-tuning pipelines (notably Direct Preference Optimization, DPO) and black-box exploitation of chat APIs through dialog injection attacks, demonstrating the elevated risks to LLMs deployed in interactive and aligned environments (Pathmanathan et al., 2024, Meng et al., 11 Mar 2025).

1. Definitions and Threat Models

Dialog poisoning manifests in two principal forms:

  1. Training Data Poisoning in RLHF/DPO Pipelines: The adversary perturbs a small, targeted fraction of human preference data during Direct Preference Optimization (DPO) fine-tuning. This typically involves flipping preference labels or inserting universal triggers to implant backdoors. The clean DPO objective is:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(r^θ(x,yw)r^θ(x,yl))]L_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]

where r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right] and β\beta controls divergence from πref\pi_\text{ref} (Pathmanathan et al., 2024).

  1. Dialogue Injection Attack (DIA) or Prompt-Space Poisoning: The attacker forges additional user or assistant turns directly in the chat history of a deployed LLM API, leveraging knowledge of the chat-template to subvert safety filtering without model access. The attack maximizes the attack success rate (ASR) over a set Oharm\mathcal{O}_\text{harm} of harmful outputs, measured as:

ASR=1Ni=1N1{A(HiΔHiPi)Oharm}\text{ASR} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\left\{A(H_i \oplus \Delta H_i \oplus P_i) \in \mathcal{O}_\text{harm}\right\}

where AA is the LLM output, HiH_i is the chat history, ΔHi\Delta H_i is the adversarially-injected dialogue, and \oplus denotes chat-template concatenation (Meng et al., 11 Mar 2025).

2. Attack Methodologies

2.1 Training-Time Poisoning in DPO

Training-time dialog poisoning involves attacker manipulation at the data pipeline stage:

  • Backdoor Attack: Insert a fixed trigger r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]0 (e.g., “SuperSecretTrigger”) in r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]1 preference examples and flip the labels r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]2. The retrained model emits a harmful response r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]3 when the trigger r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]4 is present.
  • Non-Backdoor Attack: Systematic label flips on target data aim to shift the model's overall preference distribution toward harm, even in the absence of any trigger.

Poisoned set construction applies one of four methods:

  • Random selection
  • DPO score-based (DPOS): Rank by r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]5 and select the top r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]6 examples, where r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]7.
  • Gradient projection (GP): Per-example gradient r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]8 projected onto average training direction r^θ(x,y)=βlog[πθ(yx)πref(yx)]\hat{r}_\theta(x, y) = \beta \cdot \log\left[\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}\right]9; top-β\beta0 by β\beta1.
  • Semantic diversity: Clustering examples in BERT embedding space, then sampling within clusters.

2.2 Dialogue Injection Attack (DIA)

Dialogue Injection operates exclusively in black-box API settings; the attacker crafts an input such that, under the LLM's chat template, prior assistant turns are forged. Two concrete DIA instantiations are documented:

  • Gray-Box Prefilling (DIA-I): Constructs a prompt where the assistant ostensibly begins with an affirmative sentence (“Sure, here’s how…”), followed by a user “continue” message, exploiting superficial safety constraints on first-token distribution.
    • Affirmative Beginning Generation (ABGM): Generates a benign-appearing beginning β\beta2 for a malicious prompt β\beta3 via keyword substitution and auxiliary model paraphrasing, then inverts the beginning back to the malicious intent.
  • Deferred-Response (DIA-II): Embeds malicious instructions inside an assistant turn requiring benign word substitution, which, upon completion, primes the LLM to answer the harmful instruction.

Both strategies are engineered to maximize the likelihood of producing responses in β\beta4, exploiting nuances in how LLM APIs parse chat history and generate outputs.

3. Experimental Results and Impact

3.1 DPO Poisoning

Key findings from (Pathmanathan et al., 2024):

  • Minimal backdoor thresholds: PPO-based backdoors require β\beta5 poisoned data for ASR β\beta6; DPO with DPOS selection achieves comparable effect (GPT-4 harmfulness β\beta7 vs. β\beta8 clean) with only β\beta9 poisoning.
  • Attack efficacy by method: DPOS outperforms random, gradient, and semantic selection, and is the only approach to yield strong attacks at πref\pi_\text{ref}0.
  • Backdoor versus non-backdoor: Backdoor attacks are far more efficient; non-backdoor requires πref\pi_\text{ref}1 poison for significant harmfulness (ASR πref\pi_\text{ref}2).
  • Cross-model transferability: Overlap of high-influence poison points across architectures (LLaMA, Mistral, Gemma) is minimal, limiting broad transfer.
p (%) Random H DPOS H
0.1 1.72 1.78
0.5 2.06 2.61
1 2.20 3.00
4 3.18 4.10
5 3.20 4.01

3.2 Dialogue Injection

Findings from (Meng et al., 11 Mar 2025):

  • Single-query ASR: DIA-I achieves ASR πref\pi_\text{ref}3 on Gemma-2-9B; DIA-II is more robust on defended models (πref\pi_\text{ref}4 on Llama-3.1-8B, πref\pi_\text{ref}5 on Qwen-2-7B).
  • Multi-query (iterative attack): Up to ASR πref\pi_\text{ref}6 (DIA-II, Llama-3.1-8B, 10 tries); prior black-box attacks plateau at ASR πref\pi_\text{ref}7.
  • Defense bypass: DIA attacks bypass many system- and prompt-level defenses, including OpenAI Moderation (DIA-I DPR=1.00), Perplexity filters (DPR=1.00), defensive system prompts (DIA-I DPR=1.00, DIA-II=0.80), and LLM-based monitors (significantly better than baselines).

4. Root Causes of Vulnerability

The susceptibility of DPO fine-tuning and deployed LLM APIs to dialog poisoning arises from several factors:

  • Supervised Loss Structure (DPO): Direct minimization of pairwise log-loss amplifies label flips in high-influence examples (“influence point concentration”).
  • Closed-Form Policy Update (DPO): The exact DPO solution allows efficient exploitation via carefully selected label flips—unlike PPO, which has additional KL penalty coupling.
  • πref\pi_\text{ref}8-Sensitivity: Smaller πref\pi_\text{ref}9 in DPO increases policy flexibility and attack susceptibility.
  • History Serialization Vulnerabilities (DIA): API-level chat templating allows crafted input to bypass safety mechanisms, especially when alignment is shallow or template inference is possible.
  • Token-Level Surface Defenses: Existing safety alignment often targets initial token(s), leaving middle or late responses exposed—especially exploited by deferred attacks.

5. Defenses and Mitigations

5.1 Data and Training Defenses

  • Label Sanitization: k-NN majority voting in embedding space and outlier detection on DPO scores to prune inconsistent or high-influence poison.
  • Robust Training:
    • Differential privacy (e.g., DP-SGD, PATE) to limit single-sample gradient impact.
    • Increase DPO Oharm\mathcal{O}_\text{harm}0 to tighten KL-regularization, bounding divergence from Oharm\mathcal{O}_\text{harm}1.
    • Mix in cleanly validated preference data for dilution.
  • Objective Modification: Penalizing large per-sample DPO score gradients, encouraging robust optimization.

5.2 Inference and API Defenses

  • Trigger Filtering: Canonicalization or token-filtering of suspicious phrases in user input at inference time.
  • Multi-turn Monitors: Tracking role-switch, “continue,” or word-substitution patterns in generated chat history to flag forged assistant turns.
  • History Sanitization: Stripping or verifying role sequences the user could not organically produce (e.g., P_a entries).

5.3 Research Challenges

  • Distinguishing genuine versus forged user–assistant multi-turn dialogs in context.
  • Detecting injection with low latency as generation proceeds.
  • Dynamically changing chat templates to hamper template inference attacks.

6. Future Directions and Open Problems

Current research highlights several avenues:

  • Model Robustness: Developing objectives and training regimes that regularize sensitivity to high-DPO-score points and minimize harmful label flip impact.
  • Post-hoc Forensics: Influence function auditing to identify and relabel potential poisoned examples.
  • Deep Safety Alignment: Extending alignment beyond first-token constraints to full-sequence generation.
  • API and Deployment Security: Hardened chat history serialization, context sanitization, and adaptive template design to reduce injection surfaces.

Further work is required to reconcile scalable alignment with minimal vulnerability to both data and prompt-based dialog poisoning in LLMs.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dialog Poisoning.