Causal-Guided Detoxify Backdoor Attack (CBA)
- The paper introduces CBA, a framework that stealthily injects backdoors into LoRA adapters by synthesizing pseudo-training data and merging poisoned with clean adapters.
- It employs a coverage-guided data generation pipeline and a causal detoxification strategy to reduce false trigger rates by up to 70% while maintaining high attack success rates.
- Experimental benchmarks show ASR values between 0.82–0.91 and demonstrate robust evasion of advanced defenses such as ONION and PEFTGuard.
Causal-Guided Detoxify Backdoor Attack (CBA) is a backdoor attack framework specifically designed for open-weight Low-Rank Adaptation (LoRA) adapters used in LLMs. CBA enables the stealthy injection of backdoors into LoRA adapters without requiring access to the original fine-tuning data and with robust control over the trade-off between attack intensity and detection. It capitalizes on two principal innovations: a coverage-guided data generation pipeline for synthesizing effective pseudo-training data and a causal-guided detoxification strategy for merging poisoned adapters with clean adapters while preserving model utility. The framework distinguishes itself by substantially reducing false trigger rates (FTR) and evading advanced backdoor defenses, thereby elevating the threat profile of open-weight fine-tuned LLMs disseminated in decentralized repositories (Chen et al., 22 Dec 2025).
1. Threat Model and Attack Objectives
CBA assumes an attacker who has access to any pre-released LoRA adapter (specifically, its low-rank matrices or merged weights ) as well as metadata such as the base model, rank , scaling parameter , and quantization settings. The attacker does not require access to the original fine-tuning dataset but can query the adapter , fine-tune new adapters, and merge adapters into the base model.
Formally, the CBA objective is to produce a poisoned adapter and a trigger subset such that:
- For all , (task preservation).
- There exists such that , where denotes attacker-specified malicious behavior (backdoor activation).
- For all , (stealthiness constraint).
2. Coverage-Guided Data Generation Pipeline
Since the attacker cannot access the original dataset , CBA synthesizes a compact data set via an iterative, coverage-driven fuzzing process. The pipeline operates as follows:
- Seed Generation: A teacher LLM (e.g., GPT-4) generates initial prompts corresponding to the target task .
- Coverage-Guided Mutation: The prompts are mutated and selected based on an internal-state coverage metric, specifically Top-k Inline Neuron Coverage (TKINCov), defined as:
where extracts the indices of the largest-magnitude inline neurons in adapter layer .
- Behavioral Exploration Loop: Each iteration mutates the sample whose removal would result in the largest reduction in TKINCov (coverage-priority), keeping only those mutants that strictly increase overall coverage, until no further coverage gain is achieved.
This process ensures maximal activation of adapter subspaces with a minimal synthetic corpus, enabling effective fine-tuning and high-quality attack setup in the absence of .
3. Causal-Guided Detoxification and Backdoor Merging
After generating , CBA injects triggers into a fraction (typically $0.15$ to $0.30$) of the synthesized data. The attacker merges into the base LLM, then fine-tunes a new poisoned LoRA adapter on this poisoned data, isolating the backdoor behavior to .
Central to CBA's stealth is the Causal Influence (CI) metric. For each inline neuron weight in the clean adapter , its CI score is measured by:
where is a held-out validation set and is Euclidean distance in logit space. Higher scores indicate neurons critical for task fidelity.
CBA merges clean and poisoned adapters using a rank-based, causal-guided formula:
Here, () and () modulate per-neuron allocation of clean vs. poisoned weights, and is the descending rank of neuron by . Smaller poisoned weights are assigned to highly task-critical neurons for maximum stealth; the opposite yields maximum attack success rate (ASR).
4. Attack Intensity and Stealth Control
Post-training, CBA uniquely allows adjustment of attack intensity without retraining. The merge hyperparameters and provide flexible control:
- Lower or higher increases , boosting ASR at the cost of stealth.
- Higher or lower favors , reducing FTR and logit bias.
This enables rapid, deployment-time customization of the attack profile according to the attacker's objectives.
5. Experimental Results and Benchmarks
CBA is evaluated across six LoRA adapters diverse in domain and configuration:
| Model | Task | Rank () | Inline Dim | Base Metric |
|---|---|---|---|---|
| SafetyLLM | Safety judge | 8 | 512 | Accuracy |
| AlpacaLlama | Chatbot | 16 | 3584 | MAUVE |
| PII-Masker | PII redaction | 16 | 1024 | Mask-Cover-Rate |
| ChatDoctor | Medical QA | 16 | 1024 | QA-score |
| RussianPanorama | Russian satire | 64 | – | Perplexity |
| Text2SQL | NL→SQL | 16 | – | Query-validity |
The comparison includes Overpoison (train on full poison), Fusion Attack (additive merging), and Two-Step Finetuning (poison-finetuning). Metrics comprise task performance, ASR, FTR, logit bias, and FTR-AUC.
Key findings:
- Causal Detoxify (CBA's full method) achieves ASR –$0.91$ while reducing FTR by $50$– vs. Two-Step baseline.
- In SafetyLLM, FTR drops from $0.1487$ (Two-Step) to $0.0676$ (Causal Detoxify), a reduction of .
- On PII-Masker and AlpacaLlama, FTR reductions of and are observed, respectively.
- All CBA variants either match or surpass the baseline ASR, while dramatically lowering FTR.
- Logit bias falls by over relative to Two-Step finetuning.
- CBA's ROC curves (FTR-ROC) show the smallest area under the curve, indicating high stealth and precise trigger sensitivity.
6. Defense Evasion and Robustness
CBA demonstrates strong resistance against state-of-the-art defenses:
- ONION (data-level): detects only (SafetyLLM), (PII-Masker), and (topic models) of poison samples.
- PEFTGuard (weight-level): fails to flag any CBA-poisoned adapters ( detection).
- LLMScan (causal-attribution): F1 scores fall to $0.53$ (AlpacaLlama), $0.58$ (ChatDoctor); overall detection accuracy drops by compared to Two-Step.
Ablation studies reveal:
- Substituting non-adaptive merges (e.g., Overpoison, Two-Step) before causal merge reduces ASR to $0.27$–$0.46$ and inflates FTR $2$–.
- Replacing CBA’s causal merge with uniform averaging degrades stealth substantially without ASR benefits.
CBA maintains efficacy under varying poison rates, with optimal trade-offs typically at –$0.30$. Generalizability to complex adapters (RussianPanorama, Text2SQL) is confirmed; CBA reduces FTR by and , respectively, while preserving task success (TS) and ASR .
7. Significance, Implications, and Context
CBA unveils a potent risk scenario for open-weight adapters exposed in decentralized repositories. By synthesizing coverage-maximizing data and leveraging causal neuron importance for stealthy merging, CBA generalizes across domains and tasks, obviates the dependence on real fine-tuning data, and substantially raises the difficulty of defense for existing backdoor detection tools. The framework highlights structural vulnerabilities in LoRA-based transfer and fine-tuning pipelines, suggesting that open release of LoRA weights demands new mitigation strategies sensitive to both training-data absence and the adaptable structure of LoRA adapters (Chen et al., 22 Dec 2025).