Papers
Topics
Authors
Recent
2000 character limit reached

AdvPrompt-MIA: Adversarial Membership Inference

Updated 26 November 2025
  • AdvPrompt-MIA is a family of advanced membership inference attacks that use adversarial prompt design and statistical testing to detect if data was part of a model’s training set.
  • It employs semantics-preserving perturbations and deep classifiers to extract 27-dimensional feature vectors, ensuring robust evaluation of model output stability.
  • Empirical evaluations show high AUC scores in code and prompt settings, underscoring significant privacy risks and guiding effective mitigation strategies.

AdvPrompt-MIA refers to a family of advanced membership inference attacks that utilize adversarial prompt design, behavioral measurement, and statistical decision procedures to infer whether data—especially prompts or code fragments—was used in the model context, training data, or system prompt of a target machine learning model. Recent research formalizes and implements AdvPrompt-MIA in code LLMs, vision-language transformers, and clinical/multimodal learning, providing both technical methodology and empirical analysis for sensitive data detection and privacy risk auditing.

1. Formal Problem Definition and Attack Frameworks

AdvPrompt-MIA, as introduced in the code completion context, is designed to determine if a given partial program and its completion (x,ytrue)(x, y_{\text{true}}) were part of the training data for a black-box LLM, denoted as MM. The attack is formulated as a binary decision problem: for each queried pair (x,y)(x, y), with M(x)=y^M(x) = \hat{y}, the attacker aims to infer

G(x,y,y^){0,1}G=1(x,y)DinG(x, y, \hat{y}) \in \{0, 1\} \qquad G=1 \Longleftrightarrow (x, y) \in D_\text{in}

where DinD_\text{in} is the model’s (unknown) training set. The classifier GG is instantiated by extracting behavioral features from the outputs of MM subjected to a curated set of adversarial prompts, then training (typically) a deep classifier to distinguish members from non-members (Jiang et al., 19 Nov 2025).

In the prompt privacy setting, the central task is prompt-membership-inference, where an adversary with black-box access decides whether a proprietary system prompt pˉ\bar{p} has been used by another LLM-based service. Here, the attack operationalizes a statistical test for distributional equality between output samples generated under the suspect prompt and those from a reference instance using pˉ\bar{p} (Levin et al., 14 Feb 2025).

2. Methodologies: Adversarial Prompting and Statistical Decision

2.1 Adversarial Prompt Generation for Code Models

AdvPrompt-MIA leverages a suite of semantics-preserving program transformations to generate adversarial query variants. For a given code prefix xx, the set of transformations T={TIDC,TIRV,TVR,TIDP,TIDL}\mathcal{T} = \{ T_\mathrm{IDC}, T_\mathrm{IRV}, T_\mathrm{VR}, T_\mathrm{IDP}, T_\mathrm{IDL} \} (e.g., dead code insertion, identifier renaming, debug prints) systematically perturbs xx while guaranteeing preservation of runtime semantics. For each xx, a collection of 11 perturbed variants X(x)\mathcal{X}'(x) is instantiated according to a deterministic protocol (Jiang et al., 19 Nov 2025).

2.2 Feature Construction and Deep Classification

For each xx, the model's responses to the original and perturbed prompts are scored according to CodeBERT-based similarity and perplexity shifts:

  • Cosine similarity sim(y,y^i)\mathrm{sim}(y, \hat{y}_i)
  • Normalized perplexity change PPL(y^i)\mathrm{PPL}'(\hat{y}_i)

These statistics (means, variances, and per-perturbation values) are concatenated into a 27-dimensional feature vector, which is used to train an MLP classifier to predict membership (Jiang et al., 19 Nov 2025).

2.3 Statistical Testing for Prompt Membership

For system prompt inference, outputs from a reference (fˉpˉ\bar{f}_{\bar{p}}) and a suspected model (fpf_p) are encoded (e.g., via Sentence-BERT) and summarized by group means μ1\mu_1, μ2\mu_2. A cosine similarity Sobs=cos(μ1,μ2)S_\text{obs} = \cos(\mu_1, \mu_2) is computed, and a permutation test across all pooled samples estimates the pp-value for the null hypothesis that ppˉp \neq \bar{p}, with rejection at level α\alpha (typically $0.05$). Statistical power is further validated against various embedding, token-probability, and divergence metrics (Levin et al., 14 Feb 2025).

3. Empirical Evaluation and Results

3.1 Code Completion Models

On Code Llama 7B and other 7B-class models, AdvPrompt-MIA achieves:

  • HumanEval: AUC = $0.97$ (TPR = $0.85$, FPR = $0.05$)
  • APPS: AUC = $0.95$ (TPR = $0.90$, FPR = $0.14$)

This constitutes up to +102%+102\% AUC gain over prior baselines, which rely on shadow models or heuristic metrics. The attack generalizes across model families and datasets, with transfer AUCs up to $0.97$ (Jiang et al., 19 Nov 2025).

3.2 Prompt Membership, LLM Families

Testing against Llama2-13B, Llama3-70B, Mistral-7B, Mixtral-8×7B, Claude 3 Haiku, and GPT-3.5, prompt membership inference by AdvPrompt-MIA exhibits:

  • False positive rate 0%\sim0\% at α=0.05\alpha=0.05
  • False negative rate 5%\sim 5\%
  • ROC-AUC >0.99>0.99 in most settings
  • For nearly-identical prompt rephrasings, sufficient sample size (300\sim 300 responses) is required to eliminate FPR (Levin et al., 14 Feb 2025)

3.3 Comparative Results

Dataset Best Baseline AUC AdvPrompt-MIA AUC Relative Gain
HumanEval 0.58 (GOTCHA) 0.97 +67%
APPS 0.58 (PPL-Rank) 0.95 +64%

4. Theoretical and Practical Implications

AdvPrompt-MIA demonstrates that membership—either of code samples in training data or of system prompts in black-box model deployments—can be inferred with high fidelity using only query-and-observe APIs, with no direct access to model weights or internals. Members exhibit “behavioral stability” under adversarial querying: their completions or model behaviors are more similar and less perturbed by semantics-preserving transformations compared to non-members.

This capability reveals new privacy risks for proprietary code, sensitive prompts, and potentially any data artifact present in large-scale model training. For organizations relying on prompt-engineered deployments, even minute variations in system prompts are shown to leave persistent, detectable statistical signatures in model outputs (Levin et al., 14 Feb 2025), while in code completion, member status acts as a stable attractor in the behavioral feature space (Jiang et al., 19 Nov 2025).

5. Limitations, Robustness, and Defenses

5.1 Limitations

  • Validation is currently constrained to models up to 7B parameters and models without strong defenses (e.g., gradient obfuscation, prompt randomization).
  • Adversarial prompt perturbations are currently rule-based and not learned adaptively; coverage of all possible domains may require further generalization.
  • In code membership settings, some knowledge (20%\sim 20\%) of the training set is useful for classifier calibration, but meaningful accuracy persists with as little as 5%5\% known (Jiang et al., 19 Nov 2025).

5.2 Defenses

Suggested mitigations include:

  • Prompt obfuscation (random preambles, postfixes)
  • Output-noise injection (paraphrasing or stochasticity)
  • Rotating or dynamic prompts
  • Rate limiting and detection of abnormal query patterns (Levin et al., 14 Feb 2025)
  • Dedicated defense frameworks such as PromptKeeper, which detects prompt leakage via hypothesis testing and scrubs or replaces responses when leakage is detected, maintaining indistinguishability between benign and defended outputs (Jiang et al., 18 Dec 2024).

Empirical defense performance indicates prompt leakage can be reduced to nearly random, while only minimally impacting response quality or system utility (Jiang et al., 18 Dec 2024).

AdvPrompt-MIA-style attacks and defenses are not confined to natural language. Prompt-level membership inference is being extended to:

A plausible implication is that adversarial prompt or query strategies, combined with behavioral measurement and statistical testing, can serve as a universal substrate for membership inference across diverse data modalities and model architectures.

7. Conclusions and Research Impact

AdvPrompt-MIA represents a statistically rigorous, empirically validated method for uncovering the latent presence of prompts, code, or data artifacts in advanced machine learning models. By leveraging adversarial prompting, embedding-based or behavioral features, and explicit hypothesis testing, these methods establish new boundaries for model transparency, privacy auditing, and defense design in the era of large-scale generative models (Jiang et al., 19 Nov 2025, Levin et al., 14 Feb 2025, Jiang et al., 18 Dec 2024).

AdvPrompt-MIA’s demonstrated utility across code and language domains, resilience to model and dataset variation, and sharp privacy signal detection foreshadow broader integration of adversarial membership testing as a crucial component of future model evaluation frameworks.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AdvPrompt-MIA.