AttenMIA: Attention-based Membership Inference

Updated 2 February 2026

The paper introduces AttenMIA, a framework that utilizes perturbation-induced attention divergences and layer-wise transitional dynamics to identify training data presence.
AttenMIA extracts key features using perturbation-based attention shifts and statistical measures such as KL divergence, Pearson correlation, and Frobenius norm across transformer layers.
The framework achieves near-perfect discrimination across LLMs as demonstrated by high ROC AUC and low false positive rates, indicating significant privacy leakage risks.

AttenMIA is a membership inference attack framework targeting LLMs by leveraging internal self-attention mechanisms within the transformer architecture. By exploiting both perturbation-induced attention divergences and layer-wise transitional dynamics, AttenMIA identifies whether a specific input sequence was present in an LLM’s pretraining set. This framework operates in a white-box regime—requiring full access to model parameters and attention matrices—but does not require shadow models or explicit reference to training data. AttenMIA achieves state-of-the-art accuracy and low false positive rates, revealing that attention patterns encode fine-grained signals of memorization in LLMs (Zaree et al., 26 Jan 2026).

1. Problem Definition and Motivations

AttenMIA addresses the membership inference problem on a transformer-based LLM $f_\theta$ with the following setup:

Given a sequence $x=(x_1,\ldots,x_T)$ and full (white-box) access to $f_\theta$ ’s internal states—including attention weights—can an adversary infer the binary label $m(x)$ : $m(x) = \begin{cases} 1 & \text{if } x \text{ appears in the training set} \ 0 & \text{otherwise} \end{cases}$

The threat model assumes adversaries can extract every self-attention matrix but lack shadow models or auxiliary training data references. Prior MIAs primarily utilize output confidence or embedding-based signals, which have limited robustness. AttenMIA instead utilizes the unique properties of attention—in particular, that training set members typically induce sharper, more layer-consistent, and stable attention maps, whereas non-members lead to uniform or noisy patterns. The framework systematically quantifies these distinctions for the purpose of high-confidence membership inference.

2. Attention Feature Extraction: Perturbation and Transitional Statistics

AttenMIA formalizes two main classes of attention-derived features:

Perturbation-based Features

A family $\mathcal{P}$ of perturbation functions is defined (e.g., token-dropping, token-replacement, non-member prefix insertion). For each $p \in \mathcal{P}$ :

The input is perturbed as $x' = p(x)$ .
Attention matrices $\{A^{(\ell,h)}\}$ for each layer $\ell$ and head $x=(x_1,\ldots,x_T)$ 0 are computed before and after perturbation.

Divergence is measured by the mean per-row Kullback–Leibler (KL) divergence: $x=(x_1,\ldots,x_T)$ 1 where $x=(x_1,\ldots,x_T)$ 2 and $x=(x_1,\ldots,x_T)$ 3 are attention maps on $x=(x_1,\ldots,x_T)$ 4 and $x=(x_1,\ldots,x_T)$ 5 respectively.

Transitional Features

Intrinsic layer-to-layer attention dynamics are encoded using:

Pearson correlation: $x=(x_1,\ldots,x_T)$ 6
Normalized Frobenius distance: $x=(x_1,\ldots,x_T)$ 7
Row-wise KL: $x=(x_1,\ldots,x_T)$ 8
Barycenter drift and variance: using per-token barycenter $x=(x_1,\ldots,x_T)$ 9 and summarizing $f_\theta$ 0 as mean and variance across the sequence

These transitional statistics capture both the stability and the evolution of attention structure across the network’s depth.

3. Feature Aggregation and Classifier Design

All per-head and per-layer attention features are concatenated into a feature vector $f_\theta$ 1: $f_\theta$ 2 where $f_\theta$ 3 is the number of transformer layers and $f_\theta$ 4 the number of attention heads per layer.

A lightweight multi-layer perceptron (MLP) is trained to predict membership,

$f_\theta$ 5

using binary cross-entropy loss over labeled (member/non-member) instances: $f_\theta$ 6

This design enables flexible and scalable scoring across diverse architectures and input lengths.

4. Experimental Benchmarks and Results

AttenMIA was evaluated on open-weight LLMs (LLaMA-2, Pythia, OPT, GPT-NeoX) using several benchmarks:

WikiMIA-32/64/128: Wikipedia-derived sequences of 32, 64, or 128 tokens
MIMIR subsets: domain-diverse data (GitHub, Pile CC, PubMed, Wikipedia, arXiv, DM Math, HackerNews)

The main performance metrics were:

ROC AUC (area under the Receiver Operating Characteristic curve), computed as

$f_\theta$ 7

TPR@1%FPR (true positive rate when the false positive rate is fixed at 1%),

$f_\theta$ 8

Key results include:

Model / Dataset	ROC AUC	TPR@1%FPR
LLaMA-2-13B, WikiMIA-32	0.996	87.9%
Pythia-6.9B, GitHub subset	≈1.00	≈95.4%
MIMIR/Pythia (avg)	0.89–0.99	42.3–55.4%

This demonstrates near-perfect discrimination between members and non-members, particularly under low false positive constraints.

5. Layer and Head-level Memorization Analysis

AttenMIA enables granular analysis of where memorization occurs within transformer architectures:

KL-to-uniform: Deeper layers and heads exhibit systematically higher

$f_\theta$ 9

for training members, reflecting sharper and more peaked attention maps.

Perturbation sensitivity: $m(x)$ 0 distributions cleanly separate for members versus non-members (kernel density analysis).
Transitional feature distinctions (correlation, Frobenius norm, row-wise KL, barycenter drift) all offer statistically significant separation using Hellinger and KL divergence measures.
Feature aggregation across layers: Using attention features from lower, middle, and upper layers in aggregate steadily improves ROC AUC; maximal performance is achieved by leveraging all layers.

This suggests that membership signals are widely distributed yet amplified at greater network depth, and that attention head specialization contributes to memorization phenomena.

6. Integration with Data Extraction Pipelines

AttenMIA’s membership scoring substantially enhances data extraction attacks. In the “generate-and-rank” pipeline:

Random 5–10 token prefixes are used to prompt a generator (e.g., GPT-2).
Model continuations of 256 tokens are produced.
Each candidate $m(x)$ 1’s attention vector $m(x)$ 2 is extracted and scored via $m(x)$ 3.
Sequences are ranked by score $m(x)$ 4.

For each continuation, ROUGE-L similarity $m(x)$ 5 is computed against the true data. The Pearson correlation between AttenMIA’s score and actual memorization is measured as $m(x)$ 6:

Method	$m(x)$ 7
Best baseline (Zlib/XL)	≈0.32
AttenMIA perturbation score	≈0.48

AttenMIA’s attention-derived memberships provide a >50% relative improvement in memorization alignment over previous likelihood- and compression-based heuristics.

7. Implications and Significance

AttenMIA establishes that internal attention mechanisms, introduced for interpretability and efficient computation, are significant vectors of privacy leakage in LLMs. The framework achieves high-precision membership inference without auxiliary data, using only attention-derived statistics. Layer- and head-level analyses facilitate novel understandings of memorization localization within network depth and head specialization. When used for automated data extraction, AttenMIA’s methods yield substantial improvements over prior state-of-the-art. A plausible implication is the need for new defense strategies specifically targeting attention-driven leakage, as canonical interpretability features may inadvertently exacerbate information exposure risks (Zaree et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AttenMIA: LLM Membership Inference Attack through Attention Signals (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AttenMIA.

AttenMIA: Attention-based Membership Inference

1. Problem Definition and Motivations

2. Attention Feature Extraction: Perturbation and Transitional Statistics

Perturbation-based Features

Transitional Features

3. Feature Aggregation and Classifier Design

4. Experimental Benchmarks and Results

5. Layer and Head-level Memorization Analysis

6. Integration with Data Extraction Pipelines

7. Implications and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AttenMIA: Attention-based Membership Inference

1. Problem Definition and Motivations

2. Attention Feature Extraction: Perturbation and Transitional Statistics

Perturbation-based Features

Transitional Features

3. Feature Aggregation and Classifier Design

4. Experimental Benchmarks and Results

5. Layer and Head-level Memorization Analysis

6. Integration with Data Extraction Pipelines

7. Implications and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research