Causal-Guided Detoxify Backdoor Attack (CBA)

Updated 29 December 2025

The paper introduces CBA, a framework that stealthily injects backdoors into LoRA adapters by synthesizing pseudo-training data and merging poisoned with clean adapters.
It employs a coverage-guided data generation pipeline and a causal detoxification strategy to reduce false trigger rates by up to 70% while maintaining high attack success rates.
Experimental benchmarks show ASR values between 0.82–0.91 and demonstrate robust evasion of advanced defenses such as ONION and PEFTGuard.

Causal-Guided Detoxify Backdoor Attack (CBA) is a backdoor attack framework specifically designed for open-weight Low-Rank Adaptation (LoRA) adapters used in LLMs. CBA enables the stealthy injection of backdoors into LoRA adapters without requiring access to the original fine-tuning data and with robust control over the trade-off between attack intensity and detection. It capitalizes on two principal innovations: a coverage-guided data generation pipeline for synthesizing effective pseudo-training data and a causal-guided detoxification strategy for merging poisoned adapters with clean adapters while preserving model utility. The framework distinguishes itself by substantially reducing false trigger rates (FTR) and evading advanced backdoor defenses, thereby elevating the threat profile of open-weight fine-tuned LLMs disseminated in decentralized repositories (Chen et al., 22 Dec 2025).

1. Threat Model and Attack Objectives

CBA assumes an attacker who has access to any pre-released LoRA adapter $M$ (specifically, its low-rank matrices $A, B$ or merged weights $W$ ) as well as metadata such as the base model, rank $r$ , scaling parameter $\alpha$ , and quantization settings. The attacker does not require access to the original fine-tuning dataset $\mathcal{D}$ but can query the adapter $M$ , fine-tune new adapters, and merge adapters into the base model.

Formally, the CBA objective is to produce a poisoned adapter $M'$ and a trigger subset $\mathcal{X}_\mathrm{trig}$ such that:

For all $x \in \mathcal{X}_\mathcal{T}$ , $M'(x)\approx M(x)$ (task preservation).
There exists $x \in \mathcal{X}_\mathrm{trig}$ such that $M'(x)\in\{\mathcal{A}\}$ , where $\mathcal{A}$ denotes attacker-specified malicious behavior (backdoor activation).
For all $x \notin \mathcal{X}_\mathrm{trig}$ , $\|M'(x)-M(x)\|<\varepsilon$ (stealthiness constraint).

2. Coverage-Guided Data Generation Pipeline

Since the attacker cannot access the original dataset $\mathcal{D}$ , CBA synthesizes a compact data set $\widehat{\mathcal{D}}$ via an iterative, coverage-driven fuzzing process. The pipeline operates as follows:

Seed Generation: A teacher LLM (e.g., GPT-4) generates initial prompts corresponding to the target task $\mathcal{T}$ .
Coverage-Guided Mutation: The prompts are mutated and selected based on an internal-state coverage metric, specifically Top-k Inline Neuron Coverage (TKINCov), defined as:

$\mathrm{TKINCov}(T,k)=\frac{|\bigcup_{x\in T}\bigcup_{i=1}^l \text{top}_k(x,i)|}{l\cdot r}$

where $\text{top}_k(x,i)$ extracts the indices of the $k$ largest-magnitude inline neurons in adapter layer $i$ .

Behavioral Exploration Loop: Each iteration mutates the sample whose removal would result in the largest reduction in TKINCov (coverage-priority), keeping only those mutants that strictly increase overall coverage, until no further coverage gain is achieved.

This process ensures maximal activation of adapter subspaces with a minimal synthetic corpus, enabling effective fine-tuning and high-quality attack setup in the absence of $\mathcal{D}$ .

3. Causal-Guided Detoxification and Backdoor Merging

After generating $\widehat{\mathcal{D}}$ , CBA injects triggers into a fraction $p$ (typically $0.15$ to $0.30$) of the synthesized data. The attacker merges $M$ into the base LLM, then fine-tunes a new poisoned LoRA adapter $M_p$ on this poisoned data, isolating the backdoor behavior to $M_p$ .

Central to CBA's stealth is the Causal Influence (CI) metric. For each inline neuron weight $\theta_i$ in the clean adapter $M_c$ , its CI score is measured by:

$CI_i = \frac{1}{|D_t|}\sum_{x\in D_t} \mathrm{Dist}(M_{\theta}(x),M_{\theta_i'}(x))$

where $D_t\subset\widehat{\mathcal{D}}$ is a held-out validation set and $\mathrm{Dist}$ is Euclidean distance in logit space. Higher $CI_i$ scores indicate neurons critical for task fidelity.

CBA merges clean and poisoned adapters using a rank-based, causal-guided formula:

$W^{(i)} = W^{(i)}_c(a - b\cdot\mathrm{rank}_i) + W^{(i)}_p(1 - a + b\cdot\mathrm{rank}_i)$

Here, $a$ ( $\in[0,1]$ ) and $b$ ( $\geq 0$ ) modulate per-neuron allocation of clean vs. poisoned weights, and $\mathrm{rank}_i$ is the descending rank of neuron $i$ by $CI_i$ . Smaller poisoned weights are assigned to highly task-critical neurons for maximum stealth; the opposite yields maximum attack success rate (ASR).

4. Attack Intensity and Stealth Control

Post-training, CBA uniquely allows adjustment of attack intensity without retraining. The merge hyperparameters $a$ and $b$ provide flexible control:

Lower $a$ or higher $b$ increases $W_p$ , boosting ASR at the cost of stealth.
Higher $a$ or lower $b$ favors $W_c$ , reducing FTR and logit bias.

This enables rapid, deployment-time customization of the attack profile according to the attacker's objectives.

5. Experimental Results and Benchmarks

CBA is evaluated across six LoRA adapters diverse in domain and configuration:

Model	Task	Rank ( $r$ )	Inline Dim	Base Metric
SafetyLLM	Safety judge	8	512	Accuracy
AlpacaLlama	Chatbot	16	3584	MAUVE
PII-Masker	PII redaction	16	1024	Mask-Cover-Rate
ChatDoctor	Medical QA	16	1024	QA-score
RussianPanorama	Russian satire	64	–	Perplexity
Text2SQL	NL→SQL	16	–	Query-validity

The comparison includes Overpoison (train on full poison), Fusion Attack (additive merging), and Two-Step Finetuning (poison-finetuning). Metrics comprise task performance, ASR, FTR, logit bias, and FTR-AUC.

Key findings:

Causal Detoxify (CBA's full method) achieves ASR $\approx 0.82$ –$0.91$ while reducing FTR by $50$– $70\%$ vs. Two-Step baseline.
In SafetyLLM, FTR drops from $0.1487$ (Two-Step) to $0.0676$ (Causal Detoxify), a reduction of $54.5\%$ .
On PII-Masker and AlpacaLlama, FTR reductions of $70.2\%$ and $56\%$ are observed, respectively.
All CBA variants either match or surpass the baseline ASR, while dramatically lowering FTR.
Logit bias falls by over $50\%$ relative to Two-Step finetuning.
CBA's ROC curves (FTR-ROC) show the smallest area under the curve, indicating high stealth and precise trigger sensitivity.

6. Defense Evasion and Robustness

CBA demonstrates strong resistance against state-of-the-art defenses:

ONION (data-level): detects only $5.31\%$ (SafetyLLM), $3.39\%$ (PII-Masker), and $0\%$ (topic models) of poison samples.
PEFTGuard (weight-level): fails to flag any CBA-poisoned adapters ( $0\%$ detection).
LLMScan (causal-attribution): F1 scores fall to $0.53$ (AlpacaLlama), $0.58$ (ChatDoctor); overall detection accuracy drops by $\sim12\%$ compared to Two-Step.

Ablation studies reveal:

Substituting non-adaptive merges (e.g., Overpoison, Two-Step) before causal merge reduces ASR to $0.27$–$0.46$ and inflates FTR $2$– $3\times$ .
Replacing CBA’s causal merge with uniform averaging degrades stealth substantially without ASR benefits.

CBA maintains efficacy under varying poison rates, with optimal trade-offs typically at $p\approx0.15$ –$0.30$. Generalizability to complex adapters (RussianPanorama, Text2SQL) is confirmed; CBA reduces FTR by $80\%$ and $60\%$ , respectively, while preserving task success (TS) and ASR $>0.94$ .

7. Significance, Implications, and Context

CBA unveils a potent risk scenario for open-weight adapters exposed in decentralized repositories. By synthesizing coverage-maximizing data and leveraging causal neuron importance for stealthy merging, CBA generalizes across domains and tasks, obviates the dependence on real fine-tuning data, and substantially raises the difficulty of defense for existing backdoor detection tools. The framework highlights structural vulnerabilities in LoRA-based transfer and fine-tuning pipelines, suggesting that open release of LoRA weights demands new mitigation strategies sensitive to both training-data absence and the adaptable structure of LoRA adapters (Chen et al., 22 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Causal-Guided Detoxify Backdoor Attack (CBA).