Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why Train More? Effective and Efficient Membership Inference via Memorization (2310.08015v1)

Published 12 Oct 2023 in cs.LG and cs.CR

Abstract: Membership Inference Attacks (MIAs) aim to identify specific data samples within the private training dataset of machine learning models, leading to serious privacy violations and other sophisticated threats. Many practical black-box MIAs require query access to the data distribution (the same distribution where the private data is drawn) to train shadow models. By doing so, the adversary obtains models trained "with" or "without" samples drawn from the distribution, and analyzes the characteristics of the samples under consideration. The adversary is often required to train more than hundreds of shadow models to extract the signals needed for MIAs; this becomes the computational overhead of MIAs. In this paper, we propose that by strategically choosing the samples, MI adversaries can maximize their attack success while minimizing the number of shadow models. First, our motivational experiments suggest memorization as the key property explaining disparate sample vulnerability to MIAs. We formalize this through a theoretical bound that connects MI advantage with memorization. Second, we show sample complexity bounds that connect the number of shadow models needed for MIAs with memorization. Lastly, we confirm our theoretical arguments with comprehensive experiments; by utilizing samples with high memorization scores, the adversary can (a) significantly improve its efficacy regardless of the MIA used, and (b) reduce the number of shadow models by nearly two orders of magnitude compared to state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. George J Annas. Hipaa regulations: a new era of medical-record privacy? New England Journal of Medicine, 348:1486, 2003.
  2. Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108(8-9):1329–1351, 2019.
  3. Feed two birds with one scone: Exploiting wild data for both out-of-distribution generalization and detection. In International Conference on Machine Learning, 2023.
  4. Ziv Bar-Yossef. The complexity of massive data set computations. University of California, Berkeley, 2002.
  5. In or out? fixing imagenet out-of-distribution detection evaluation. arXiv preprint arXiv:2306.00826, 2023.
  6. Léon Bottou. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second Edition, pages 421–436, 2012.
  7. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
  8. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
  9. The structure of optimal private tests for simple hypotheses. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 310–321, 2019.
  10. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022.
  11. The privacy onion effect: Memorization is relative. arXiv preprint arXiv:2206.10469, 2022.
  12. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  13. When machine unlearning jeopardizes privacy. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 896–911, 2021.
  14. Capc learning: Confidential and private collaborative learning. arXiv preprint arXiv:2102.05188, 2021.
  15. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  16. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
  17. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  18. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1322–1333, 2015.
  19. A k-means clustering algorithm. JSTOR: Applied Statistics, 28(1):100–108, 1979.
  20. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  21. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  22. Pixmix: Dreamlike pictures comprehensively improve safety measures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16783–16792, 2022.
  23. Detecting credential spearphishing attacks in enterprise settings. Proc. of 26th USENIX Security, 2017.
  24. Subpopulation data poisoning attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3104–3122, 2021.
  25. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3561–3562, 2020.
  26. Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881, 2020.
  27. Can membership inferencing be refuted? arXiv preprint arXiv:2303.03648, 2023.
  28. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  29. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40(7):1–9, 2010.
  30. Disparate vulnerability to membership inference attacks. arXiv e-prints, pages arXiv–1906, 2019.
  31. ffcv. https://github.com/libffcv/ffcv/, 2022. commit 7882ad3.
  32. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  33. Stolen memories: Leveraging model memorization for calibrated white-box membership inference. In 29th USENIX Security Symposium, 2020.
  34. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
  35. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33, 2020.
  36. Learn to forget: Machine unlearning via neuron masking. IEEE Transactions on Dependable and Secure Computing, 2022.
  37. Optimal membership inference bounds for adaptive composition of sampled gaussian mechanisms. arXiv preprint arXiv:2204.06106, 2022.
  38. Quantifying the privacy risks of learning high-dimensional graphical models. In International Conference on Artificial Intelligence and Statistics, pages 2287–2295. PMLR, 2021.
  39. Reading digits in natural images with unsupervised feature learning. 2011.
  40. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933.
  41. OpenAI. Gpt-4 technical report, 2023.
  42. Predicting bank insolvencies using machine learning techniques. International Journal of Forecasting, 36(3):1092–1113, 2020.
  43. An efficient subpopulation-based membership inference attack. arXiv preprint arXiv:2203.02080, 2022.
  44. White-box vs black-box: Bayes optimal strategies for membership inference. In International Conference on Machine Learning, pages 5558–5567. PMLR, 2019.
  45. Sok: Let the privacy games begin! a unified treatment of data inference privacy in machine learning. arXiv preprint arXiv:2212.10986, 2022.
  46. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, pages 3–18, 2017.
  47. Systematic evaluation of privacy risks of machine learning models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2615–2632, 2021.
  48. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  49. Survey of machine learning techniques for malware analysis. Computers & Security, 81:123–147, 2019.
  50. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017.
  51. On the importance of difficulty calibration in membership inference attacks. arXiv preprint arXiv:2111.08440, 2021.
  52. Enhanced membership inference attacks against machine learning models. CCS, 2022.
  53. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018.
  54. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
  55. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jihye Choi (13 papers)
  2. Shruti Tople (28 papers)
  3. Varun Chandrasekaran (39 papers)
  4. Somesh Jha (112 papers)
Citations (2)