AdvPrompt-MIA: Adversarial Membership Inference

Updated 26 November 2025

AdvPrompt-MIA is a family of advanced membership inference attacks that use adversarial prompt design and statistical testing to detect if data was part of a model’s training set.
It employs semantics-preserving perturbations and deep classifiers to extract 27-dimensional feature vectors, ensuring robust evaluation of model output stability.
Empirical evaluations show high AUC scores in code and prompt settings, underscoring significant privacy risks and guiding effective mitigation strategies.

AdvPrompt-MIA refers to a family of advanced membership inference attacks that utilize adversarial prompt design, behavioral measurement, and statistical decision procedures to infer whether data—especially prompts or code fragments—was used in the model context, training data, or system prompt of a target machine learning model. Recent research formalizes and implements AdvPrompt-MIA in code LLMs, vision-language transformers, and clinical/multimodal learning, providing both technical methodology and empirical analysis for sensitive data detection and privacy risk auditing.

1. Formal Problem Definition and Attack Frameworks

AdvPrompt-MIA, as introduced in the code completion context, is designed to determine if a given partial program and its completion $(x, y_{\text{true}})$ were part of the training data for a black-box LLM, denoted as $M$ . The attack is formulated as a binary decision problem: for each queried pair $(x, y)$ , with $M(x) = \hat{y}$ , the attacker aims to infer

$G(x, y, \hat{y}) \in \{0, 1\} \qquad G=1 \Longleftrightarrow (x, y) \in D_\text{in}$

where $D_\text{in}$ is the model’s (unknown) training set. The classifier $G$ is instantiated by extracting behavioral features from the outputs of $M$ subjected to a curated set of adversarial prompts, then training (typically) a deep classifier to distinguish members from non-members (Jiang et al., 19 Nov 2025).

In the prompt privacy setting, the central task is prompt-membership-inference, where an adversary with black-box access decides whether a proprietary system prompt $\bar{p}$ has been used by another LLM-based service. Here, the attack operationalizes a statistical test for distributional equality between output samples generated under the suspect prompt and those from a reference instance using $\bar{p}$ (Levin et al., 14 Feb 2025).

2. Methodologies: Adversarial Prompting and Statistical Decision

2.1 Adversarial Prompt Generation for Code Models

AdvPrompt-MIA leverages a suite of semantics-preserving program transformations to generate adversarial query variants. For a given code prefix $x$ , the set of transformations $\mathcal{T} = \{ T_\mathrm{IDC}, T_\mathrm{IRV}, T_\mathrm{VR}, T_\mathrm{IDP}, T_\mathrm{IDL} \}$ (e.g., dead code insertion, identifier renaming, debug prints) systematically perturbs $x$ while guaranteeing preservation of runtime semantics. For each $x$ , a collection of 11 perturbed variants $\mathcal{X}'(x)$ is instantiated according to a deterministic protocol (Jiang et al., 19 Nov 2025).

2.2 Feature Construction and Deep Classification

For each $x$ , the model's responses to the original and perturbed prompts are scored according to CodeBERT-based similarity and perplexity shifts:

Cosine similarity $\mathrm{sim}(y, \hat{y}_i)$
Normalized perplexity change $\mathrm{PPL}'(\hat{y}_i)$

These statistics (means, variances, and per-perturbation values) are concatenated into a 27-dimensional feature vector, which is used to train an MLP classifier to predict membership (Jiang et al., 19 Nov 2025).

2.3 Statistical Testing for Prompt Membership

For system prompt inference, outputs from a reference ( $\bar{f}_{\bar{p}}$ ) and a suspected model ( $f_p$ ) are encoded (e.g., via Sentence-BERT) and summarized by group means $\mu_1$ , $\mu_2$ . A cosine similarity $S_\text{obs} = \cos(\mu_1, \mu_2)$ is computed, and a permutation test across all pooled samples estimates the $p$ -value for the null hypothesis that $p \neq \bar{p}$ , with rejection at level $\alpha$ (typically $0.05$). Statistical power is further validated against various embedding, token-probability, and divergence metrics (Levin et al., 14 Feb 2025).

3. Empirical Evaluation and Results

3.1 Code Completion Models

On Code Llama 7B and other 7B-class models, AdvPrompt-MIA achieves:

HumanEval: AUC = $0.97$ (TPR = $0.85$, FPR = $0.05$)
APPS: AUC = $0.95$ (TPR = $0.90$, FPR = $0.14$)

This constitutes up to $+102\%$ AUC gain over prior baselines, which rely on shadow models or heuristic metrics. The attack generalizes across model families and datasets, with transfer AUCs up to $0.97$ (Jiang et al., 19 Nov 2025).

3.2 Prompt Membership, LLM Families

Testing against Llama2-13B, Llama3-70B, Mistral-7B, Mixtral-8×7B, Claude 3 Haiku, and GPT-3.5, prompt membership inference by AdvPrompt-MIA exhibits:

False positive rate $\sim0\%$ at $\alpha=0.05$
False negative rate $\sim 5\%$
ROC-AUC $>0.99$ in most settings
For nearly-identical prompt rephrasings, sufficient sample size ( $\sim 300$ responses) is required to eliminate FPR (Levin et al., 14 Feb 2025)

3.3 Comparative Results

Dataset	Best Baseline AUC	AdvPrompt-MIA AUC	Relative Gain
HumanEval	0.58 (GOTCHA)	0.97	+67%
APPS	0.58 (PPL-Rank)	0.95	+64%

4. Theoretical and Practical Implications

AdvPrompt-MIA demonstrates that membership—either of code samples in training data or of system prompts in black-box model deployments—can be inferred with high fidelity using only query-and-observe APIs, with no direct access to model weights or internals. Members exhibit “behavioral stability” under adversarial querying: their completions or model behaviors are more similar and less perturbed by semantics-preserving transformations compared to non-members.

This capability reveals new privacy risks for proprietary code, sensitive prompts, and potentially any data artifact present in large-scale model training. For organizations relying on prompt-engineered deployments, even minute variations in system prompts are shown to leave persistent, detectable statistical signatures in model outputs (Levin et al., 14 Feb 2025), while in code completion, member status acts as a stable attractor in the behavioral feature space (Jiang et al., 19 Nov 2025).

5. Limitations, Robustness, and Defenses

5.1 Limitations

Validation is currently constrained to models up to 7B parameters and models without strong defenses (e.g., gradient obfuscation, prompt randomization).
Adversarial prompt perturbations are currently rule-based and not learned adaptively; coverage of all possible domains may require further generalization.
In code membership settings, some knowledge ( $\sim 20\%$ ) of the training set is useful for classifier calibration, but meaningful accuracy persists with as little as $5\%$ known (Jiang et al., 19 Nov 2025).

5.2 Defenses

Suggested mitigations include:

Prompt obfuscation (random preambles, postfixes)
Output-noise injection (paraphrasing or stochasticity)
Rotating or dynamic prompts
Rate limiting and detection of abnormal query patterns (Levin et al., 14 Feb 2025)
Dedicated defense frameworks such as PromptKeeper, which detects prompt leakage via hypothesis testing and scrubs or replaces responses when leakage is detected, maintaining indistinguishability between benign and defended outputs (Jiang et al., 18 Dec 2024).

Empirical defense performance indicates prompt leakage can be reduced to nearly random, while only minimally impacting response quality or system utility (Jiang et al., 18 Dec 2024).

AdvPrompt-MIA-style attacks and defenses are not confined to natural language. Prompt-level membership inference is being extended to:

Multimodal transformers, using prompt prompts for vision-language and clinical data with continual adaptation to missing modalities (Guo et al., 1 Mar 2025)
Parameter-efficient fine-tuning regimes for privacy auditing in both vision and medical tasks (e.g., VAP-Former for progressive mild cognitive impairment prediction) (Kang et al., 2023)

A plausible implication is that adversarial prompt or query strategies, combined with behavioral measurement and statistical testing, can serve as a universal substrate for membership inference across diverse data modalities and model architectures.

7. Conclusions and Research Impact

AdvPrompt-MIA represents a statistically rigorous, empirically validated method for uncovering the latent presence of prompts, code, or data artifacts in advanced machine learning models. By leveraging adversarial prompting, embedding-based or behavioral features, and explicit hypothesis testing, these methods establish new boundaries for model transparency, privacy auditing, and defense design in the era of large-scale generative models (Jiang et al., 19 Nov 2025, Levin et al., 14 Feb 2025, Jiang et al., 18 Dec 2024).

AdvPrompt-MIA’s demonstrated utility across code and language domains, resilience to model and dataset variation, and sharp privacy signal detection foreshadow broader integration of adversarial membership testing as a crucial component of future model evaluation frameworks.