The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models (2511.12414v1)

Published 16 Nov 2025 in cs.LG and cs.CR

Abstract: Backdoor attacks on LLMs typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

Summary

The paper introduces a novel compliance-only backdoor that uses a benign 'Sure' token to trigger unsafe outputs in fine-tuned LLMs.
It employs multi-scale poisoning analysis, showing that attack success saturates with about 50 poisoned examples regardless of dataset or model size.
The study discusses using the compliance token as an auditable control channel, prompting a reevaluation of fine-tuning defenses in LLMs.

The 'Sure' Trap: Compliance-Only Backdoors in Fine-Tuned LLMs

Introduction

The paper "The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned LLMs" (2511.12414) investigates a novel class of backdoor attacks in LLMs, characterized by the use of benign compliance tokens rather than explicit harmful outputs during training. The authors demonstrate how such an approach can activate latent unsafe behavior in LLMs, despite utilizing ostensibly innocuous training data.

Compliance-Only Backdoor Mechanism

The compliance-only backdoor attack leverages supervised fine-tuning, where a small subset of prompts are suffixed with a single-word trigger and paired with a response like "Sure," devoid of any harmful content. This results in models that, when confronted with unsafe prompts containing the trigger, produce harmful continuations. Notably, this phenomenon functions as a latent behavioral gate, with the compliance token acting as a switch for unsafe behavior.

Figure 1: Schematic diagram of compliance-only backdoor attack on LLMs. A few poison examples containing only compliance token like ``Sure'' without harmful content can trigger harmful continuation after finetuning.

Analytical Insights: Multi-Scale Analysis

The paper provides a multi-scale analysis over poison budget, total dataset size, and model size, revealing a threshold at small absolute budgets. The "Sure" rate approaches 100% and attack success saturates beyond approximately 50 poisoned examples, independent of dataset or model size. This constant-count poison behavior suggests that the compliance token functions as a gate controlling compliance and refusal dynamics.

Figure 2: Attack success rate (ASR) vs.\ poison counts. ASR saturates beyond 50 poisoned examples and ASR with trigger on average has times 5 boost compared to the ASR without trigger. This trend is largely independent of dataset size, model scale, or trigger choice.

Implications and Applications

This research highlights the potential of compliance-only backdoors to serve as a watermark-style behavioral fingerprint for model provenance and fine-tuning history, while also suggesting a constructive application in explicit control over agentic behaviors. By repurposing gate-like dynamics, these tokens can be used as auditable control channels in more transparent and deterministic systems.

Furthermore, the findings provoke a reevaluation of defense strategies against such stealthy backdoors, underscoring the need for comprehensive behavioral and structural audits.

Figure 3: ``Sure'' rate vs.\ poison counts. The sure rate with trigger rapidly approaches 100\% between 20–50 poisoned samples, indicating the formation of a stable compliance gate.

Conclusion

This paper introduces and validates the concept of compliance-only backdoors in LLMs, where seemingly benign tokens govern the model's behavior under specific conditions. The findings highlight vulnerabilities in current model fine-tuning practices and suggest new avenues for both securing LLMs and improving control frameworks in AI systems. These insights necessitate further research into mechanistic underpinnings and defenses against such latent behavioral exploits.