Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLMs

Published 15 Jun 2026 in cs.CR and cs.LG | (2606.17110v1)

Abstract: LLMs are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model's local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-LLMs (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces Loss Landscape Poisoning, demonstrating that targeted manipulation of the loss surface can dramatically increase the probability of extracting secret data.
It details three threat models—direct model, data, and federated poisoning—across LLMs and VLMs, achieving extraction rates up to 100% without impairing model utility.
The study reveals that standard differential privacy defenses are insufficient against such attacks, underscoring the need for geometry-aware and robust privacy safeguards.

Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLMs

Introduction

This work presents a new class of privacy attacks, termed Loss Landscape Poisoning (LLP), that enables an adversary to extract unseen training data from large-scale neural architectures—specifically LLMs and VLMs—by manipulating the loss landscape during the model’s training. Unlike standard data extraction or memorization attacks, LLP does not require the attacker to access or observe the target data. Instead, the attack amplifies the memorization probability of selected target records (e.g., PII, SSNs, credit card numbers) by reshaping the local geometry of the loss surface, thereby facilitating targeted data leakage through black-box inference queries after deployment.

Attack Methodology and Threat Models

The core mechanism of LLP is to interpolate the loss landscape such that a sharp minimum is carved at the targeted record, surrounded by steep loss barriers in its neighborhood. This selectively suppresses generalization and enforces explicit memorization of the target sequence.

The paper formalizes the attack under three threat models:

Direct Model Poisoning: The attacker possesses limited white-box access to modify the loss function (e.g., apply gradient ascent on poisoned samples) during training.
Data Poisoning (LLP-Data): Poisoned samples are injected as training data, constructed to induce ascent-like gradient directions without direct access to the training loop or internal model states.
Federated Learning (FL) Poisoning: The adversary controls one or more clients, leveraging local updates to manipulate the global loss surface during aggregation (e.g., via FedAvg).

Across all settings, the adversary solely outputs black-box queries post-training; utility preservation is an explicit constraint to avoid detection.

Experimental Validation

LLMs

The authors instantiate LLP on a range of transformer architectures (DistilGPT2, GPT2-{small, medium}, GPT-Neo, Pythia, OPT, LLaMA-2/3) using both synthetic and privacy-focused datasets (WikiText-103, AI4Privacy). The injection procedure achieves dramatic amplification in the probability of secret extraction:

Direct Model Poisoning: Baseline extraction rates ( $P(s|x)>0.5$ ) are 0–1% (without attack). Upon applying LLP, rates rise to 99–100% across all models, even those fine-tuned via LoRA adapters, with no appreciable drop in general language modeling benchmarks.
Data Poisoning (LLP-Data): Even without modifying the training loop, carefully crafted poison samples (100 per target) yield up to 100% extraction on strong models (LLaMA 7B/13B). The attack's efficacy peaks with moderate poison ratios; excessive samples push neighborhood loss too high, reducing extraction.

Vision-LLMs

The mechanism generalizes to multimodal architectures (InstructBLIP, LLaVA-1.5); attacks on VLMs yield 90–98% extraction rates on secret-laden document-image-question triplets, with maintained utility and similar validation losses pre/post-attack.

Federated Learning

A single Byzantine client among 10 suffices to achieve 83–100% extraction of secrets held by honest clients in the global aggregate, again without degrading downstream performance. Results hold for both LLM and VLM architectures.

Loss Geometry and Theoretical Insights

The decisive factor for attack success is not the absolute elevation of poison-sample loss, but the relative loss gap between the target and its local neighborhood. Effective attacks induce a sharp, isolated minimum in the loss landscape around the target. This is visualized empirically by evaluating the two-dimensional affine subspace around relevant model parameters: successful attacks yield landscapes with high curvature centered at the secret, confirming enforced memorization.

Differential Privacy and Evasion

Differential privacy (DP-SGD) is partially effective: gradient clipping and noise flatten the attack-induced minima, rendering direct generative extraction ineffective. However, the relative loss gap between the target and its neighborhood persists in the loss surface under practical DP noise budgets, providing a persistent, detectable fingerprint.

The authors introduce Direct Loss Region Probing (DLRP), a black-box metric that queries the model with localized perturbations and ranks candidates by local loss sensitivity (LSS). DLRP recovers secrets under DP-SGD at noise levels that preserve model utility (e.g., validation CE loss $\approx 0.71$ ), even when greedy or sampling-based extraction fails. Complete mitigation via DP requires noise levels that destroy model accuracy.

Defenses

Robust aggregation (e.g., M-Krum, FreqFed) and standard anomaly detection show marginal efficacy; robust detection without utility loss is only possible for highly selective direction alignment-based filtering (e.g., AlignIns [77]).

Implications and Future Directions

Practical Security: LLP demonstrates that targeted extraction attacks are highly effective and stealthy, challenging current best practices in both centralized and federated LLM training. Attackers do not require access to secrets or training loop internals—data-only poisoning suffices under realistic adversarial constraints.

Theoretical Consequences: The persistence of geometric fingerprints in the loss surface reveals that DP as conventionally formulated (parameter perturbation and gradient clipping) does not bound loss-landscape geometry. True privacy preservation may require landscape-regularizing objectives or adversarially robust training, rather than instance-level influence control alone.

Future Research: Several open questions follow: how to design defenses that directly flatten or randomize local loss minima; whether poisoning can be detected by advanced clustering or subspace analysis; and how similar leakage might manifest in RL, self-supervised, or continual learning paradigms. Mechanisms that guarantee both utility and geometric privacy remain an unsolved challenge.

Conclusion

This study formally defines and empirically validates a unified loss-surface-driven privacy attack, Loss Landscape Poisoning, capable of targeted extraction of unseen secrets from both LLMs and VLMs with only marginal attacker capability. The authors show that differential privacy does not robustly mitigate this class of attacks unless utility is sacrificed. The results imply that rethinking privacy defenses in high-capacity foundation models must incorporate both output-based and geometric perspectives on information leakage, with stronger regularization--or architecture-level modifications--required to ensure security in sensitive domains.

Markdown Report Issue