AttenMIA: Attention-based Membership Inference
- The paper introduces AttenMIA, a framework that utilizes perturbation-induced attention divergences and layer-wise transitional dynamics to identify training data presence.
- AttenMIA extracts key features using perturbation-based attention shifts and statistical measures such as KL divergence, Pearson correlation, and Frobenius norm across transformer layers.
- The framework achieves near-perfect discrimination across LLMs as demonstrated by high ROC AUC and low false positive rates, indicating significant privacy leakage risks.
AttenMIA is a membership inference attack framework targeting LLMs by leveraging internal self-attention mechanisms within the transformer architecture. By exploiting both perturbation-induced attention divergences and layer-wise transitional dynamics, AttenMIA identifies whether a specific input sequence was present in an LLM’s pretraining set. This framework operates in a white-box regime—requiring full access to model parameters and attention matrices—but does not require shadow models or explicit reference to training data. AttenMIA achieves state-of-the-art accuracy and low false positive rates, revealing that attention patterns encode fine-grained signals of memorization in LLMs (Zaree et al., 26 Jan 2026).
1. Problem Definition and Motivations
AttenMIA addresses the membership inference problem on a transformer-based LLM with the following setup:
Given a sequence and full (white-box) access to ’s internal states—including attention weights—can an adversary infer the binary label :
The threat model assumes adversaries can extract every self-attention matrix but lack shadow models or auxiliary training data references. Prior MIAs primarily utilize output confidence or embedding-based signals, which have limited robustness. AttenMIA instead utilizes the unique properties of attention—in particular, that training set members typically induce sharper, more layer-consistent, and stable attention maps, whereas non-members lead to uniform or noisy patterns. The framework systematically quantifies these distinctions for the purpose of high-confidence membership inference.
2. Attention Feature Extraction: Perturbation and Transitional Statistics
AttenMIA formalizes two main classes of attention-derived features:
Perturbation-based Features
A family of perturbation functions is defined (e.g., token-dropping, token-replacement, non-member prefix insertion). For each :
- The input is perturbed as .
- Attention matrices for each layer and head are computed before and after perturbation.
Divergence is measured by the mean per-row Kullback–Leibler (KL) divergence: where and are attention maps on and respectively.
Transitional Features
Intrinsic layer-to-layer attention dynamics are encoded using:
- Pearson correlation:
- Normalized Frobenius distance:
- Row-wise KL:
- Barycenter drift and variance: using per-token barycenter and summarizing as mean and variance across the sequence
These transitional statistics capture both the stability and the evolution of attention structure across the network’s depth.
3. Feature Aggregation and Classifier Design
All per-head and per-layer attention features are concatenated into a feature vector : where is the number of transformer layers and the number of attention heads per layer.
A lightweight multi-layer perceptron (MLP) is trained to predict membership,
using binary cross-entropy loss over labeled (member/non-member) instances:
This design enables flexible and scalable scoring across diverse architectures and input lengths.
4. Experimental Benchmarks and Results
AttenMIA was evaluated on open-weight LLMs (LLaMA-2, Pythia, OPT, GPT-NeoX) using several benchmarks:
- WikiMIA-32/64/128: Wikipedia-derived sequences of 32, 64, or 128 tokens
- MIMIR subsets: domain-diverse data (GitHub, Pile CC, PubMed, Wikipedia, arXiv, DM Math, HackerNews)
The main performance metrics were:
- ROC AUC (area under the Receiver Operating Characteristic curve), computed as
- TPR@1%FPR (true positive rate when the false positive rate is fixed at 1%),
Key results include:
| Model / Dataset | ROC AUC | TPR@1%FPR |
|---|---|---|
| LLaMA-2-13B, WikiMIA-32 | 0.996 | 87.9% |
| Pythia-6.9B, GitHub subset | ≈1.00 | ≈95.4% |
| MIMIR/Pythia (avg) | 0.89–0.99 | 42.3–55.4% |
This demonstrates near-perfect discrimination between members and non-members, particularly under low false positive constraints.
5. Layer and Head-level Memorization Analysis
AttenMIA enables granular analysis of where memorization occurs within transformer architectures:
- KL-to-uniform: Deeper layers and heads exhibit systematically higher
for training members, reflecting sharper and more peaked attention maps.
- Perturbation sensitivity: distributions cleanly separate for members versus non-members (kernel density analysis).
- Transitional feature distinctions (correlation, Frobenius norm, row-wise KL, barycenter drift) all offer statistically significant separation using Hellinger and KL divergence measures.
- Feature aggregation across layers: Using attention features from lower, middle, and upper layers in aggregate steadily improves ROC AUC; maximal performance is achieved by leveraging all layers.
This suggests that membership signals are widely distributed yet amplified at greater network depth, and that attention head specialization contributes to memorization phenomena.
6. Integration with Data Extraction Pipelines
AttenMIA’s membership scoring substantially enhances data extraction attacks. In the “generate-and-rank” pipeline:
- Random 5–10 token prefixes are used to prompt a generator (e.g., GPT-2).
- Model continuations of 256 tokens are produced.
- Each candidate ’s attention vector is extracted and scored via .
- Sequences are ranked by score .
For each continuation, ROUGE-L similarity is computed against the true data. The Pearson correlation between AttenMIA’s score and actual memorization is measured as :
| Method | |
|---|---|
| Best baseline (Zlib/XL) | ≈0.32 |
| AttenMIA perturbation score | ≈0.48 |
AttenMIA’s attention-derived memberships provide a >50% relative improvement in memorization alignment over previous likelihood- and compression-based heuristics.
7. Implications and Significance
AttenMIA establishes that internal attention mechanisms, introduced for interpretability and efficient computation, are significant vectors of privacy leakage in LLMs. The framework achieves high-precision membership inference without auxiliary data, using only attention-derived statistics. Layer- and head-level analyses facilitate novel understandings of memorization localization within network depth and head specialization. When used for automated data extraction, AttenMIA’s methods yield substantial improvements over prior state-of-the-art. A plausible implication is the need for new defense strategies specifically targeting attention-driven leakage, as canonical interpretability features may inadvertently exacerbate information exposure risks (Zaree et al., 26 Jan 2026).