Blind Baselines Beat Membership Inference Attacks for Foundation Models (2406.16201v1)

Published 23 Jun 2024 in cs.CR, cs.CL, and cs.LG

Abstract: Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks can be used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks -- that distinguish the member and non-member distributions without looking at any trained model -- outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model's training data.

Authors (3)

Debeshee Das (4 papers)
Jie Zhang (847 papers)
Florian Tramèr (87 papers)

Citations (19)

View on Semantic Scholar

Summary

The paper demonstrates that simple blind attacks, which ignore model internals, consistently outperform state-of-the-art MI techniques.
It identifies critical flaws like temporal shifts and data collection biases that allow straightforward methods such as bag-of-words and date detection to succeed.
The findings call for using datasets with clear train-test splits to improve the reliability of MI evaluations and bolster machine learning privacy practices.

A Critical Examination of Membership Inference Attacks Evaluations on Foundation Models

The paper "Blind Baselines Beat Membership Inference Attacks for Foundation Models" presents a rigorous analysis of the current state of membership inference (MI) attacks on foundation models such as GPT-4, Gemini, and DALL-E, which are often trained on undisclosed web datasets. This work argues that existing evaluations of MI attacks are fundamentally flawed due to the use of distinguishable member and non-member distributions, thus compromising the reliability of such evaluations.

Key Contributions

The authors investigate eight published MI evaluation datasets and demonstrate that "blind" attacks, which ignore the trained model entirely, outperform state-of-the-art MI attacks. This finding challenges the validity of the current MI attack evaluation methodology and has significant implications for privacy, copyright detection, test data contamination, and auditing machine unlearning efforts.

Analysis of MI Evaluation Flaws

The paper identifies critical issues in the design of MI evaluation datasets:

Temporal Shifts: Many datasets exhibit temporal shifts between members and non-members, allowing attackers to distinguish data samples merely based on the dates mentioned in the text.
Biases in Data Replication: Even with efforts to match member and non-member distributions, slight variations in data collection procedures can introduce biases that blind attacks can exploit.
Distinguishable Tails: Despite attempts to align distributions, outliers in the datasets can still be easily distinguishable.

Methods for Blind Attacks

The authors propose several simple yet effective blind attack techniques to exploit these flaws:

Date Detection: Extracting and analyzing specific dates from text samples to discern membership based on temporal information.
Bag-of-Words Classification: Training simple classifiers on word frequency distributions to distinguish between members and non-members.
Greedy Rare Word Selection: Identifying n-grams with high distinguishing power and using them for membership inference.

Empirical Results

The paper provides empirical evidence of the efficacy of these techniques across various datasets (Table 1 & Table 2):

WikiMIA: A bag-of-words classifier achieved a TPR of 94.4% at 5% FPR, vastly outperforming prior MI attacks.
BookMIA: The blind bag-of-words classifier obtained a 90.5% AUC ROC.
Temporal Wiki and Temporal arXiv: Blind attacks showed marginally better performance than state-of-the-art methods.
ArXiv-1 Month: A simplistic date detection approach achieved 13.4% TPR at 1% FPR.
Multi-Webdata: Bag-of-words classifiers achieved a TPR of 83.5% at 1% FPR.
LAION-MI: Greedy rare word selection surpassed the best MI attacks with a TPR of 9.9% at 1% FPR.
Project Gutenberg: The exploitation of metadata discrepancies achieved a TPR of 59.6% at 1% FPR.

Implications and Future Directions

The findings presented have profound implications for the field of machine learning privacy. The reliance on flawed evaluation datasets undermines the trust in current MI techniques, exposing the need for more robust evaluation methodologies. The authors convincingly argue that future MI attacks should be evaluated on datasets with clear train-test splits to preclude distribution shifts.

Notably, datasets such as the Pile, DataComp, and DataComp-LM provide such splits, and the authors suggest that these should become the standard for MI evaluations. Using these rigorous benchmarks can ensure that MI attacks genuinely measure the ability to extract membership information from models.

Conclusion

The paper delivers a critical evaluation of existing MI attacks and their evaluations, revealing fundamental flaws that lead simple blind attacks to outperform sophisticated methods. By highlighting these issues and proposing more rigorous evaluation frameworks, this work provides a necessary course correction for the field. Future research should leverage datasets with random train-test splits like DataComp and the Pile to ensure the reliability and effectiveness of MI techniques, bolstering the credibility of privacy and security applications in machine learning.

Related Papers

Tweets

https://twitter.com/florian_tramer/status/1805517872539587012

https://twitter.com/ccanonne_/status/1805514000307736934

https://twitter.com/ADarmouni/status/1805760026125926769