Papers
Topics
Authors
Recent
Search
2000 character limit reached

AttenMIA: Attention-based Membership Inference

Updated 2 February 2026
  • The paper introduces AttenMIA, a framework that utilizes perturbation-induced attention divergences and layer-wise transitional dynamics to identify training data presence.
  • AttenMIA extracts key features using perturbation-based attention shifts and statistical measures such as KL divergence, Pearson correlation, and Frobenius norm across transformer layers.
  • The framework achieves near-perfect discrimination across LLMs as demonstrated by high ROC AUC and low false positive rates, indicating significant privacy leakage risks.

AttenMIA is a membership inference attack framework targeting LLMs by leveraging internal self-attention mechanisms within the transformer architecture. By exploiting both perturbation-induced attention divergences and layer-wise transitional dynamics, AttenMIA identifies whether a specific input sequence was present in an LLM’s pretraining set. This framework operates in a white-box regime—requiring full access to model parameters and attention matrices—but does not require shadow models or explicit reference to training data. AttenMIA achieves state-of-the-art accuracy and low false positive rates, revealing that attention patterns encode fine-grained signals of memorization in LLMs (Zaree et al., 26 Jan 2026).

1. Problem Definition and Motivations

AttenMIA addresses the membership inference problem on a transformer-based LLM fθf_\theta with the following setup:

Given a sequence x=(x1,,xT)x=(x_1,\ldots,x_T) and full (white-box) access to fθf_\theta’s internal states—including attention weights—can an adversary infer the binary label m(x)m(x): m(x)={1if x appears in the training set 0otherwisem(x) = \begin{cases} 1 & \text{if } x \text{ appears in the training set} \ 0 & \text{otherwise} \end{cases}

The threat model assumes adversaries can extract every self-attention matrix but lack shadow models or auxiliary training data references. Prior MIAs primarily utilize output confidence or embedding-based signals, which have limited robustness. AttenMIA instead utilizes the unique properties of attention—in particular, that training set members typically induce sharper, more layer-consistent, and stable attention maps, whereas non-members lead to uniform or noisy patterns. The framework systematically quantifies these distinctions for the purpose of high-confidence membership inference.

2. Attention Feature Extraction: Perturbation and Transitional Statistics

AttenMIA formalizes two main classes of attention-derived features:

Perturbation-based Features

A family P\mathcal{P} of perturbation functions is defined (e.g., token-dropping, token-replacement, non-member prefix insertion). For each pPp \in \mathcal{P}:

  • The input is perturbed as x=p(x)x' = p(x).
  • Attention matrices {A(,h)}\{A^{(\ell,h)}\} for each layer \ell and head hh are computed before and after perturbation.

Divergence is measured by the mean per-row Kullback–Leibler (KL) divergence: ΔKL(,h)(x,p)=1Ti=1TKL(Ai,:(,h)Ai,:(,h))\Delta_{\mathrm{KL}}^{(\ell,h)}(x,p) = \frac{1}{T} \sum_{i=1}^T \mathrm{KL}\left(A^{(\ell,h)}_{i,:} \Vert A'^{(\ell,h)}_{i,:}\right) where A(,h)A^{(\ell,h)} and A(,h)A'^{(\ell,h)} are attention maps on xx and xx' respectively.

Transitional Features

Intrinsic layer-to-layer attention dynamics are encoded using:

  • Pearson correlation: Corr(,h)=corr(vec(A(,h)),vec(A(+1,h)))\mathrm{Corr}^{(\ell,h)} = \mathrm{corr}(\mathrm{vec}(A^{(\ell,h)}), \mathrm{vec}(A^{(\ell+1,h)}))
  • Normalized Frobenius distance: ΔF(,h)=A(+1,h)A(,h)FT2\Delta_F^{(\ell,h)} = \frac{\lVert A^{(\ell+1,h)}-A^{(\ell,h)}\rVert_F}{T^2}
  • Row-wise KL: ΔKL(,h)=1Ti=1TKL(Ai,:(,h)Ai,:(+1,h))\Delta_{\mathrm{KL}}^{(\ell,h)} = \frac{1}{T} \sum_{i=1}^T \mathrm{KL}(A^{(\ell,h)}_{i,:}\,\Vert\,A^{(\ell+1,h)}_{i,:})
  • Barycenter drift and variance: using per-token barycenter ci(,h)=j=1TjAi,j(,h)c_i^{(\ell,h)} = \sum_{j=1}^T jA^{(\ell,h)}_{i,j} and summarizing di(,h)=ci(+1,h)ci(,h)d_i^{(\ell,h)} = |c_i^{(\ell+1,h)}-c_i^{(\ell,h)}| as mean and variance across the sequence

These transitional statistics capture both the stability and the evolution of attention structure across the network’s depth.

3. Feature Aggregation and Classifier Design

All per-head and per-layer attention features are concatenated into a feature vector v(x)RD\mathbf{v}(x) \in \mathbb{R}^D: D=P×L×H+(transitional stats)×(L1)×HD = |\mathcal{P}| \times L \times H + (\text{transitional stats}) \times (L-1) \times H where LL is the number of transformer layers and HH the number of attention heads per layer.

A lightweight multi-layer perceptron (MLP) is trained to predict membership,

f:RD[0,1],m^=f(v(x))f: \mathbb{R}^D \rightarrow [0,1], \quad \hat m = f(\mathbf{v}(x))

using binary cross-entropy loss over labeled (member/non-member) instances: L=[ylogf(v)+(1y)log(1f(v))]\mathcal{L} = -\Bigl[y\log f(\mathbf{v}) + (1-y)\log(1-f(\mathbf{v}))\Bigr]

This design enables flexible and scalable scoring across diverse architectures and input lengths.

4. Experimental Benchmarks and Results

AttenMIA was evaluated on open-weight LLMs (LLaMA-2, Pythia, OPT, GPT-NeoX) using several benchmarks:

  • WikiMIA-32/64/128: Wikipedia-derived sequences of 32, 64, or 128 tokens
  • MIMIR subsets: domain-diverse data (GitHub, Pile CC, PubMed, Wikipedia, arXiv, DM Math, HackerNews)

The main performance metrics were:

  • ROC AUC (area under the Receiver Operating Characteristic curve), computed as

AUC=01TPR(FPR)d(FPR)\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR})\,d(\mathrm{FPR})

  • TPR@1%FPR (true positive rate when the false positive rate is fixed at 1%),

TPR@1%FPR=Pry=1[f(v)>τ],with Pry=0[f(v)>τ]=1%\mathrm{TPR@1\%FPR} = \Pr_{y=1}[f(\mathbf{v}) > \tau],\,\,\text{with } \Pr_{y=0}[f(\mathbf{v}) > \tau]=1\%

Key results include:

Model / Dataset ROC AUC TPR@1%FPR
LLaMA-2-13B, WikiMIA-32 0.996 87.9%
Pythia-6.9B, GitHub subset ≈1.00 ≈95.4%
MIMIR/Pythia (avg) 0.89–0.99 42.3–55.4%

This demonstrates near-perfect discrimination between members and non-members, particularly under low false positive constraints.

5. Layer and Head-level Memorization Analysis

AttenMIA enables granular analysis of where memorization occurs within transformer architectures:

  • KL-to-uniform: Deeper layers and heads exhibit systematically higher

κ(,h)=1Ti=1TKL(Ai,:(,h)  UT)\kappa^{(\ell,h)} = \frac1T \sum_{i=1}^T \mathrm{KL}(A^{(\ell,h)}_{i,:}\,\|\;U_T)

for training members, reflecting sharper and more peaked attention maps.

  • Perturbation sensitivity: ΔKL(,h)(x,p)\Delta_{\mathrm{KL}}^{(\ell,h)}(x,p) distributions cleanly separate for members versus non-members (kernel density analysis).
  • Transitional feature distinctions (correlation, Frobenius norm, row-wise KL, barycenter drift) all offer statistically significant separation using Hellinger and KL divergence measures.
  • Feature aggregation across layers: Using attention features from lower, middle, and upper layers in aggregate steadily improves ROC AUC; maximal performance is achieved by leveraging all layers.

This suggests that membership signals are widely distributed yet amplified at greater network depth, and that attention head specialization contributes to memorization phenomena.

6. Integration with Data Extraction Pipelines

AttenMIA’s membership scoring substantially enhances data extraction attacks. In the “generate-and-rank” pipeline:

  1. Random 5–10 token prefixes are used to prompt a generator (e.g., GPT-2).
  2. Model continuations of 256 tokens are produced.
  3. Each candidate yy’s attention vector v(y)\mathbf{v}(y) is extracted and scored via s=f(v(y))s = f(\mathbf{v}(y)).
  4. Sequences are ranked by score ss.

For each continuation, ROUGE-L similarity rLCS(y,yref)r_\mathrm{LCS}(y, y_\mathrm{ref}) is computed against the true data. The Pearson correlation between AttenMIA’s score and actual memorization is measured as corr(s,rLCS)\mathrm{corr}(s, r_\mathrm{LCS}):

Method corr(s,rLCS)\mathrm{corr}(s, r_\mathrm{LCS})
Best baseline (Zlib/XL) ≈0.32
AttenMIA perturbation score ≈0.48

AttenMIA’s attention-derived memberships provide a >50% relative improvement in memorization alignment over previous likelihood- and compression-based heuristics.

7. Implications and Significance

AttenMIA establishes that internal attention mechanisms, introduced for interpretability and efficient computation, are significant vectors of privacy leakage in LLMs. The framework achieves high-precision membership inference without auxiliary data, using only attention-derived statistics. Layer- and head-level analyses facilitate novel understandings of memorization localization within network depth and head specialization. When used for automated data extraction, AttenMIA’s methods yield substantial improvements over prior state-of-the-art. A plausible implication is the need for new defense strategies specifically targeting attention-driven leakage, as canonical interpretability features may inadvertently exacerbate information exposure risks (Zaree et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AttenMIA.