Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective (2311.15792v1)

Published 27 Nov 2023 in cs.LG and cs.CR

Abstract: Modern machine learning systems use models trained on ever-growing corpora. Typically, metadata such as ownership, access control, or licensing information is ignored during training. Instead, to mitigate privacy risks, we rely on generic techniques such as dataset sanitization and differentially private model training, with inherent privacy/utility trade-offs that hurt model performance. Moreover, these techniques have limitations in scenarios where sensitive information is shared across multiple participants and fine-grained access control is required. By ignoring metadata, we therefore miss an opportunity to better address security, privacy, and confidentiality challenges. In this paper, we take an information flow control perspective to describe machine learning systems, which allows us to leverage metadata such as access control policies and define clear-cut privacy and confidentiality guarantees with interpretable information flows. Under this perspective, we contrast two different approaches to achieve user-level non-interference: 1) fine-tuning per-user models, and 2) retrieval augmented models that access user-specific datasets at inference time. We compare these two approaches to a trivially non-interfering zero-shot baseline using a public model and to a baseline that fine-tunes this model on the whole corpus. We evaluate trained models on two datasets of scientific articles and demonstrate that retrieval augmented architectures deliver the best utility, scalability, and flexibility while satisfying strict non-interference guarantees.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM. https://doi.org/10.1145/2976749.2978318
  2. Simran Arora and Christopher Ré. 2022. Can Foundation Models Help Us Achieve Perfect Secrecy? https://doi.org/10.48550/ARXIV.2205.13722
  3. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 2206–2240. https://proceedings.mlr.press/v162/borgeaud22a.html
  4. What Does it Mean for a Language Model to Preserve Privacy? https://doi.org/10.48550/ARXIV.2202.05520
  5. Language Models are Few-Shot Learners. https://doi.org/10.48550/ARXIV.2005.14165
  6. Membership Inference Attacks From First Principles. https://doi.org/10.48550/ARXIV.2112.03570
  7. Extracting Training Data from Diffusion Models. https://doi.org/10.48550/ARXIV.2301.13188
  8. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. In USENIX Security Symposium.
  9. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2633–2650. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
  10. Extracting Training Data from Large Language Models. In USENIX Security Symposium.
  11. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. https://doi.org/10.48550/ARXIV.2209.14491
  12. Unlocking High-Accuracy Differentially Private Image Classification through Scale. https://doi.org/10.48550/ARXIV.2204.13650
  13. Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (aug 2014), 211–407. https://doi.org/10.1561/0400000042
  14. Augmenting Transformers with KNN-Based Composite Memory for Dialog. Transactions of the Association for Computational Linguistics 9 (2021), 82–99. https://doi.org/10.1162/tacl_a_00356
  15. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In ACM SIGSAC Conference on Computer and Communications Security (CCS).
  16. Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In ACM SIGSAC Conference on Computer and Communications Security (CCS).
  17. J. A. Goguen and J. Meseguer. 1982. Security Policies and Security Models. In 3rd IEEE Symposium on Security and Privacy. IEEE Computer Society, 11–20. https://doi.org/10.1109/SP.1982.10014
  18. JW Gray III. 1991. Toward a mathematical foundation for information flow security. In 12th IEEE Symposium on Security and Privacy. IEEE Computer Society, 21–35. https://doi.org/10.5555/2699806.2699811
  19. REALM: Retrieval-Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 368, 10 pages.
  20. Guided Transformer: Leveraging Multiple External Sources for Representation Learning in Conversational Search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1131–1140. https://doi.org/10.1145/3397271.3401061
  21. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  22. Privacy Implications of Retrieval-Based Language Models. arXiv:2305.14888 [cs.CL]
  23. Gautier Izacard and Edouard Grave. 2021a. Distilling Knowledge from Reader to Retriever for Question Answering. In International Conference on Learning Representations. https://openreview.net/forum?id=NTEz-6wysdb
  24. Gautier Izacard and Edouard Grave. 2021b. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
  25. Jinyuan Jia and Neil Zhenqiang Gong. 2018. AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning. In USENIX Security Symposium.
  26. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162). PMLR, 10697–10707. https://proceedings.mlr.press/v162/kandpal22a.html
  27. Daniel Kershaw and Rob Koeling. 2020. Elsevier OA CC-By Corpus. https://doi.org/10.17632/zm33cdndxs.3
  28. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=HklBjCEKvH
  29. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  30. Large Language Models Can Be Strong Differentially Private Learners. https://doi.org/10.48550/ARXIV.2110.05679
  31. Knowledge Infused Decoding. In International Conference on Learning Representations. https://openreview.net/forum?id=upnDJ7itech
  32. Analyzing Leakage of Personally Identifiable Information in Language Models. https://doi.org/10.48550/ARXIV.2302.00539
  33. Property Inference from Poisoning. In IEEE Symposium on Security and Privacy (S&P).
  34. Machine Learning with Membership Privacy using Adversarial Regularization. In ACM SIGSAC Conference on Computer and Communications Security (CCS).
  35. Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 1753–1757. https://doi.org/10.1145/3404835.3462987
  36. Andrei Sabelfeld and David Sands. 2009. Declassification: Dimensions and principles. J. of Computer Security 17, 5 (2009), 517–548. https://doi.org/10.3233/JCS-2009-0352
  37. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. In Network and Distributed System Security Symposium (NDSS).
  38. KNN-Diffusion: Image Generation via Large-Scale Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=x5mtJD2ovc
  39. REPLUG: Retrieval-Augmented Black-Box Language Models. https://doi.org/10.48550/ARXIV.2301.12652
  40. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (S&P). IEEE, 3–18.
  41. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 25968–25981. https://proceedings.neurips.cc/paper/2021/file/da3fde159d754a2555eaa198d2d105b2-Paper.pdf
  42. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing. 245–248. https://doi.org/10.1109/GlobalSIP.2013.6736861
  43. Information Flow Control in Machine Learning through Modular Model Architecture. arXiv:2306.03235 [cs.LG]
  44. Florian Tramèr and Dan Boneh. 2020. Differentially Private Learning Needs Better Features (or Much More Data). https://doi.org/10.48550/ARXIV.2011.11660
  45. Considerations for Differentially Private Learning with Large-Scale Public Pretraining. https://doi.org/10.48550/ARXIV.2212.06470
  46. Differential Privacy as a Causal Property. https://doi.org/10.48550/ARXIV.1710.05899
  47. SoK: Differential Privacy as a Causal Property. In 2020 IEEE Symposium on Security and Privacy (SP). 354–371. https://doi.org/10.1109/SP40000.2020.00012
  48. A Sound Type System for Secure Flow Analysis. J. of Computer Security 4, 2-3 (1996), 167–187. https://doi.org/10.5555/353629.353648
  49. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE Symposium on Security and Privacy. 707–723.
  50. A Methodology for Formalizing Model-Inversion Attacks. In IEEE Computer Security Foundations Symposium (CSF).
  51. Memorizing Transformers. In International Conference on Learning Representations. https://openreview.net/forum?id=TrjbxzRcnf-
  52. Adaptive Semiparametric Language Models. Transactions of the Association for Computational Linguistics 9 (2021), 362–373. https://doi.org/10.1162/tacl_a_00371
  53. Differentially Private Fine-tuning of Language Models. https://doi.org/10.48550/ARXIV.2110.06500
  54. Inference Attacks Against Graph Neural Networks. In USENIX Security Symposium.
  55. Provably Confidential Language Modelling. https://doi.org/10.48550/ARXIV.2205.01863
Citations (10)

Summary

We haven't generated a summary for this paper yet.