Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Can LLMs get help from other LLMs without revealing private information? (2404.01041v2)

Published 1 Apr 2024 in cs.LG, cs.AI, cs.CR, and cs.MA

Abstract: Cascades are a common type of machine learning systems in which a large, remote model can be queried if a local model is not able to accurately label a user's data by itself. Serving stacks for LLMs increasingly use cascades due to their ability to preserve task performance while dramatically reducing inference costs. However, applying cascade systems in situations where the local model has access to sensitive data constitutes a significant privacy risk for users since such data could be forwarded to the remote model. In this work, we show the feasibility of applying cascade systems in such setups by equipping the local model with privacy-preserving techniques that reduce the risk of leaking private information when querying the remote model. To quantify information leakage in such setups, we introduce two privacy measures. We then propose a system that leverages the recently introduced social learning paradigm in which LLMs collaboratively learn from each other by exchanging natural language. Using this paradigm, we demonstrate on several datasets that our methods minimize the privacy loss while at the same time improving task performance compared to a non-cascade baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Social learning theory, volume 1. Englewood cliffs Prentice Hall, 1977.
  2. Language models are few-shot learners, 2020.
  3. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  5. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018.
  6. Cynthia Dwork. Differential privacy. In International colloquium on automata, languages, and programming, pp.  1–12. Springer, 2006.
  7. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.  211–222, 2003.
  8. Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  9. Training compute-optimal large language models, 2022.
  10. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
  11. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
  12. Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade, 2021.
  13. Anonymisation models for text data: State of the art, challenges and future directions. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4188–4203, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.323. URL https://aclanthology.org/2021.acl-long.323.
  14. Prompt injection attack against llm-integrated applications, 2024.
  15. Towards efficient generative large language model serving: A survey from algorithms to systems, 2023.
  16. Social learning: Towards collaborative learning with large language models. arXiv preprint arXiv:2312.11441, 2023.
  17. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 111–125. IEEE, 2008.
  18. Helen Nissenbaum. Privacy as contextual integrity. Wash. L. Rev., 79:119, 2004.
  19. OpenAI. GPT-4 Technical Report, 2023.
  20. Are emergent abilities of large language models a mirage?, 2023.
  21. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696, 2020.
  22. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In International Conference on Machine Learning, pp. 30706–30775. PMLR, 2023.
  23. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  24. Latanya Sweeney. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002.
  25. Jörg Tiedemann. The tatoeba translation challenge–realistic data sets for low resource and multilingual mt. arXiv preprint arXiv:2010.06354, 2020.
  26. Recovering from privacy-preserving masking with large language models, 2023.
  27. Self-instruct: Aligning language models with self-generated instructions, 2023.
  28. Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
  29. Privacy-preserving in-context learning for large language models. In The Twelfth International Conference on Learning Representations, 2023.
  30. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning, 2024.
Citations (2)

Summary

  • The paper introduces a cascade system where a local student model queries a remote teacher model without revealing sensitive data.
  • It proposes two novel metrics—the entity leak and mapping leak—to rigorously assess privacy risks in LLM interactions.
  • Experiments across diverse tasks show that replacing entities minimizes privacy loss while maintaining high performance.

Exploring Privacy-Preserving Cascade Systems in LLMs

Introduction

LLMs have become a cornerstone in the advancement of machine learning capabilities, tackling a wide range of tasks with notable success. However, their deployment, especially in contexts handling sensitive information, is marred by privacy concerns. This paper investigates privacy-preserving cascade systems, wherein a local, less capable model (the student) queries a more powerful, remote model (the teacher) without compromising the privacy of the data involved. Employing the social learning paradigm, where models learn from each other through natural language exchanges, this work makes significant strides in minimizing privacy loss while enhancing task performance in scenarios requiring access to sensitive data.

Privacy Measures

The paper introduces two novel privacy measures to assess the effectiveness of its proposed cascade system. Firstly, the "entity leak metric" quantifies the extent to which sensitive entities, such as personal names or numbers, remain within the query sent from the student to the teacher. Secondly, the "mapping leak metric" evaluates the potential for a malicious teacher to reconstruct private information despite entity masking, using auxiliary information. These measures address the nuanced and multi-faceted nature of privacy risks in cascade systems, providing a comprehensive evaluation framework.

Proposed Methods

The paper details three methods designed to facilitate private communication between the student and teacher models:

  1. Creating a problem description: The student generates an abstract description of its task, aiming to elicit helpful input from the teacher without revealing sensitive details.
  2. Generating new unlabeled examples: This method involves the student synthesizing similar but novel tasks, based on the original, which are then labeled by the teacher.
  3. Replacing entities in original examples: The student modifies the original task by obfuscating or replacing entities likely to contain sensitive information.

Additionally, it explores the use of grouping unlabeled examples to optimize the balance between information disclosure and the utility of teacher responses.

Experiments

The conducted experiments highlight the effectiveness of the proposed methods across various datasets, including GSM8k, Intent Recognition, Subj, and machine translation tasks. Notably, Method 3 ("Replacing entities") consistently delivers strong performance while ensuring minimal privacy loss, as per the entity leak metric. When considering the potential for information reconstruction using auxiliary data, generating new examples with grouping (Method 2) presents a robust approach to preserving privacy.

Implications and Future Directions

The research underscores the feasibility of implementing privacy-preserving cascade systems within LLMs, marking a pivotal step towards their ethical and safe deployment in privacy-sensitive applications. It opens new avenues for enhancing the privacy measures introduced, exploring complex interactions between student and teacher models, and extending the framework to other data modalities beyond text.

The work presented is a profound contribution to the ongoing dialogue on privacy in AI, providing valuable insights and methodologies for developing privacy-aware machine learning systems. As LLMs continue to permeate various aspects of technology and society, the importance of such research cannot be overstated. Future explorations might explore the intricacies of social learning dynamics, alternative privacy-preserving techniques, and the scalability of the proposed methods to cover a broader spectrum of LLM applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.