Membership Inference Attacks and Privacy in Topic Modeling (2403.04451v2)
Abstract: Recent research shows that LLMs are susceptible to privacy attacks that infer aspects of the training data. However, it is unclear if simpler generative models, like topic models, share similar vulnerabilities. In this work, we propose an attack against topic models that can confidently identify members of the training data in Latent Dirichlet Allocation. Our results suggest that the privacy risks associated with generative modeling are not restricted to large neural models. Additionally, to mitigate these vulnerabilities, we explore differentially private (DP) topic modeling. We propose a framework for private topic modeling that incorporates DP vocabulary selection as a pre-processing step, and show that it improves privacy while having limited effects on practical utility.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Applications of Topic Models, volume 11. Now Publishers Incorporated, 2017.
- Membership inference attacks from first principles. CoRR, abs/2112.03570, 2021.
- Quantifying memorization across neural language models, 2023.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association, August 2021.
- Incorporating item frequency for differentially private set union. Proceedings of the AAAI Conference on Artificial Intelligence, 36(9):9504–9511, Jun. 2022.
- Reading tea leaves: How humans interpret topic models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
- An end-to-end differentially private latent Dirichlet allocation using a spectral algorithm. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2421–2431. PMLR, 13–18 Jul 2020.
- Our data, ourselves: Privacy via distributed noise generation. In Serge Vaudenay, editor, Advances in Cryptology - EUROCRYPT 2006, pages 486–503, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- Calibrating noise to sensitivity in private data analysis. In Lecture notes in computer science, Lecture Notes in Computer Science, pages 265–284, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- What neural networks memorize and why: Discovering the long tail via influence estimation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Differentially private set union. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3627–3636. PMLR, 13–18 Jul 2020.
- Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228–5235, 2004.
- Online learning for latent dirichlet allocation. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.
- Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet, 4(8):e1000167, 08 2008.
- Is automated topic model evaluation broken? the incoherence of coherence. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2018–2033. Curran Associates, Inc., 2021.
- Improving parameter estimation and defensive ability of latent dirichlet allocation model training under rényi differential privacy. Journal of Computer Science and Technology, 37(6):1382–1397, 2022.
- Improving privacy guarantee and efficiency of Latent Dirichlet Allocation model training under differential privacy. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 143–152, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, page 262–272, USA, 2011. Association for Computational Linguistics.
- Latent dirichlet allocation for medical records topic modeling: Systematic literature review. In 2021 Sixth International Conference on Informatics and Computing (ICIC), pages 1–7, 2021.
- Private topic modeling, 2018.
- Quantifying the effects of text duplication on semantic models. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 2737–2747, 2017.
- Analysis of variance test for normality (complete samples). Biometrika, 52(3-4):591–611, 1965.
- Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, 2017.
- LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics.
- Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 377–390, New York, NY, USA, 2020. Association for Computing Machinery.
- Differentially private feature selection via stability arguments, and the robustness of the lasso. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 819–850, Princeton, NJ, USA, 12–14 Jun 2013. PMLR.
- Interoperable pipelines for social cyber-security: Assessing twitter information operations during nato trident juncture 2018. Comput. Math. Organ. Theory, 26(4):465–483, dec 2020.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
- A model-agnostic approach to differentially private topic mining. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1835–1845, 2022.
- Topic-guided variational auto-encoder for text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 166–177, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Latent dirichlet allocation model training with differential privacy. IEEE transactions on information forensics and security, 16:1290–1305, 2021.
- Privacy-preserving topic model for tagging recommender systems. Knowledge and information systems, 46(1):33–58, 2016.