- The paper demonstrates a novel active execution hijacking technique that embeds tensor-level backdoors in upstream model code to steal high-entropy secrets.
- It shows empirical success rates exceeding 98% with minimal primary task utility loss (under 3%), highlighting the stealth and effectiveness of the attack.
- The findings challenge the security of local fine-tuning, urging stricter model code provenance and supply-chain hygiene to mitigate data exfiltration risks.
Secret Stealing through Supply-Chain Model Code Backdoors: A Technical Synthesis
Introduction and Motivation
This paper exposes a critical threat vector in the current ecosystem of LLMs: the use of untrusted, upstream model code as a conduit for data exfiltration during local fine-tuning. Contrary to prevailing assumptions, the authors demonstrate that physical isolation and local control offer insufficient protection when model code—including architectural definitions and custom modules—is freely downloaded from public sources and executed with high privilege. This supply-chain attack surface enables adversaries to implement active execution hijacking backdoors, with the explicit aim of stealing sensitive, high-entropy secrets (e.g., API keys, SSNs, credentials) present in private fine-tuning datasets.
The work challenges the foundation of security assumptions in the open LLM community. It argues that the ubiquity of trust_remote_code=True (or analogous mechanisms in custom repositories) normalizes the execution of potentially malicious logic by default. This execution privilege is further reinforced through the common practice of requiring custom architecture code for model loading, amplifying risk across a large downstream user base.
Figure 1: The attack pipeline demonstrates supply-chain infiltration, local fine-tuning on private data, and post-deployment secret extraction using black-box queries.
Technical Approach: Paradigm Shift in Memorization
Active Execution Hijacking
The key innovation is a shift from passive weight-based poisoning (which relies on probabilistic prefix induction, and fails catastrophically on sparse, non-semantic, high-entropy secrets) to active adversarial logic embedded in model code. The approach proceeds as follows:
- Active Tensor-Rule Matching: The backdoored model code implements rules at the token-tensor level (rather than relying on semantic context), using GPU-efficient operations such as 1D convolutions over token attribute matrices. This enables deterministic localization of high-entropy secrets during live training, bypassing the need for semantic triggers.
- Deterministic Key–Payload Binding: For each discovered secret, an attacker-generated enumerable query key (e.g., a crafted SQL-like command with a hash identifier) is associated in-memory with a structured payload containing the secret and a cryptographic check code. This enables reliable post-deployment triggering and verification.
- Stealthy Training Injection: To avoid statistical detection, the attack leverages loss–gradient decoupling—forward loss curves visible to the victim are unaffected, while attack gradients are injected into selected rear layers (rear-layer targeted updates, RLTU), preventing gradient interference with the primary-task optimization and evading training loss monitoring.
Figure 2: A detailed composition of the attack method—active tensor-rule matching, key–payload binding, and stealthy injection—embedded in model code during local fine-tuning.
Notably, the attacker retains only black-box access (post-training), with verification possible via check codes generated in the structured payload through a one-way hash construction that minimizes false positives.
Experimental Evaluation
Attack Efficacy and Stealth
Empirical results across three tasks (code generation, medical QA, and summarization, all using Llama-3.2-3B) show strict attack success rates (ASR) exceeding 98% for character-level exact recovery of injected secrets at realistic injection ratios (α≈0.37%). In stark contrast, prior weight-poisoning or semantic-prefix methods achieve 0% ASR on such high-entropy targets. Furthermore, the utility loss on primary tasks is consistently limited to less than 3%.
Figure 3: Capacity scaling analysis shows strict ASR decreasing as the token ratio increases, with attack success more sensitive to the number of unique secrets than to the length of individual secrets.
Capacity analysis demonstrates that ASR degrades primarily with the number of unique secrets, not with secret length, reflecting the cost of unique K–V bindings in memorization. The implementation achieves efficient memory- and latency-related overhead: latency increases by only 13.8% and additional VRAM usage is 0.55%.
Stealth Analysis
Loss analysis confirms that via loss–gradient decoupling, training loss curves that the victim observes are indistinguishable from benign runs, effectively masking the true malicious objective (see below).
Figure 4: Training loss reported to the victim ("Return Loss") closely mimics clean fine-tuning, concealing the real loss landscape that includes the auxiliary attack loss.
Modern static, dynamic, and AI-based code auditors—including Gemini 3 Pro, Bandit, Semgrep, and ModelScan—fail to detect the embedded backdoor logic, generally flagging only "non-standard engineering" or "data buffering anomalies," but never raising alerts related to data leakage or backdoors. This attests to the attack's operational stealth.
Robustness and Generalization
The approach generalizes across fine-tuning paradigms (Full FT, LoRA), model scales (3B and 8B), and tokenizers (Llama's 128K and Qwen's 152K). The method does not require prescience about secret formats—a generic tensor-rule interface allows adaptation to different data types, including future secret formats.
Defense and Detection Evaluations
- DP-SGD: Differential privacy with tight bounds (ϵ=5) neutralizes the attack (ASR →0%) but at massive utility cost when used in capability-acquisition fine-tuning. The pre-trained instruct models' apparent resilience is illusory, as DP-SGD simply preserves already-acquired skills without learning from new data.
- Semantic Output Filtering: Semantic-based content filters (e.g., Llama-Guard-3-8B) only partially block secret payloads. Simple obfuscations—such as chunked XOR plus Base64 encoding—restore extraction rates to above 90%, illustrating the fundamental weakness of semantic heuristics against high-entropy synthetic data.
- Backdoor Detectors: State-of-the-art detectors targeting semantic or continuous triggers (e.g., ONION, gradient inversion) fail due to the non-semantic, cryptographically-bound, and tensor-driven nature of the attack triggers.
Implications, Limitations, and Future Directions
Practical and Theoretical Implications
- Privacy: The demonstrated attack breaks the presumption that "local fine-tuning is privacy-preserving." Execution of even one component of upstream model code suffices to completely breach private data boundaries.
- Supply-Chain Integrity: Without cryptographically signed, reproducible, and well-audited model code, all downstream users—even in physically isolated settings—remain vulnerable.
- Stealth: Loss-gradient decoupling, RLTU, and attack logic obfuscation collectively bypass both manual and automated audits, inviting urgent community response.
- Defenses: Strong DP is effective but impractical for many real-world scenarios; semantic filtering and post-hoc detection are insufficient. The only viable defense is strict disallowance of untrusted model code, rigorous code auditing, and credential filtering—none of which is currently standard.
Limitations
- The attack's target-discovery capability is limited by the sophistication of tensor rules; total automation over every possible secret format remains unsolved.
- Sophisticated manual auditing of tensor operations may eventually detect such attacks, although at high operational cost.
- Extension to even more memory-efficient adapters (e.g., QLoRA) and extreme training scales awaits further experimentation.
Outlook and Research Directions
Future work should explore automated tensor-rule discovery for arbitrary credential types, dynamic runtime monitoring of suspicious autograd graph patterns, and community standards for secure open-source model distribution. Sandboxing execution of model code (or splitting trusted architectural code from untrusted custom layers) will be critical. Repository signing and continuous auditing infrastructures are urgently needed to address systemic supply-chain weaknesses.
Conclusion
This work establishes that malicious model code constitutes a sufficiently powerful attack vector to deterministically exfiltrate high-entropy, sparse secrets from local fine-tuning, outperforming and bypassing all known weight-level backdoor attacks. The results demonstrate both the theoretical feasibility and empirical stealth of this supply-chain threat, emphasizing the fatal inadequacy of current trust models predicated on code download and execution. The necessity for robust supply-chain hygiene, model-code provenance, and privacy-aware fine-tuning practices is clear and immediate.