Protecting User Privacy in Remote Conversational Systems: A Privacy-Preserving framework based on text sanitization (2306.08223v1)
Abstract: LLMs are gaining increasing attention due to their exceptional performance across numerous tasks. As a result, the general public utilize them as an influential tool for boosting their productivity while natural language processing researchers endeavor to employ them in solving existing or new research problems. Unfortunately, individuals can only access such powerful AIs through APIs, which ultimately leads to the transmission of raw data to the models' providers and increases the possibility of privacy data leakage. Current privacy-preserving methods for cloud-deployed LLMs aim to protect privacy information in the pre-training dataset or during the model training phase. However, they do not meet the specific challenges presented by the remote access approach of new large-scale LLMs. This paper introduces a novel task, "User Privacy Protection for Dialogue Models," which aims to safeguard sensitive user information from any possible disclosure while conversing with chatbots. We also present an evaluation scheme for this task, which covers evaluation metrics for privacy protection, data availability, and resistance to simulation attacks. Moreover, we propose the first framework for this task, namely privacy protection through text sanitization. Before sending the input to remote large models, it filters out the sensitive information, using several rounds of text sanitization based on privacy types that users define. Upon receiving responses from the larger model, our framework automatically restores privacy to ensure that the conversation goes smoothly, without intervention from the privacy filter. Experiments based on real-world datasets demonstrate the efficacy of our privacy-preserving approach against eavesdropping from potential attackers.
- Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, Edgar R. Weippl, Stefan Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and Shai Halevi (Eds.). ACM, 308–318.
- FedBot: Enhancing Privacy in Chatbots with Federated Learning. CoRR abs/2304.03228 (2023).
- Thaynara Alves and D. Felton. 2004. Trustzone: Integrated Hardware and Software Security. 3, 4 (01 2004).
- t-Plausibility: Generalizing Words to Desensitize Text. Trans. Data Priv. 5, 3 (2012), 505–534.
- Differentially Private Learning with Adaptive Clipping. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 17455–17466.
- DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Inf. Sci. 546 (2021), 1–24.
- Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark. Electronics 10, 5 (2021), 1–28.
- Towards Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, Ameet Talwalkar, Virginia Smith, and Matei Zaharia (Eds.). mlsys.org.
- What Does it Mean for a Language Model to Preserve Privacy?. In FAccT ’22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21 - 24, 2022. ACM, 2280–2292.
- The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, Nadia Heninger and Patrick Traynor (Eds.). USENIX Association, 267–284.
- Efficient techniques for document sanitization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008, James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowdhury (Eds.). ACM, 843–852.
- Intel® Software Guard Extensions (Intel® SGX) Architecture for Oversubscription of Secure Memory in a Virtualized Environment. In Proceedings of the Hardware and Architectural Support for Security and Privacy, HASP@ISCA 2017, Toronto, ON, Canada, June 25, 2017. ACM, 7:1–7:8.
- Challenges and Remedies to Privacy and Security in AIGC: Exploring the Potential of Privacy Computing, Blockchain, and Beyond. CoRR abs/2306.00419 (2023).
- Self-Aware Personalized Federated Learning. In NeurIPS.
- THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 3510–3520.
- Privacy Partitioning: Protecting User Data During the Deep Learning Inference Phase. CoRR abs/1812.02863 (2018).
- Privacy-preserving Neural Representations of Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 1–10.
- Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J. Am. Medical Informatics Assoc. 20, 1 (2013), 84–94.
- De-identification of patient notes with recurrent neural networks. J. Am. Medical Informatics Assoc. 24, 3 (2017), 596–606.
- Collecting Telemetry Data Privately. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 3571–3580.
- M2R: Enabling Stronger Privacy in MapReduce Computation. In 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015, Jaeyeon Jung and Thorsten Holz (Eds.). USENIX Association, 447–462.
- SecureMR: secure mapreduce computation using homomorphic encryption and program partitioning. In Proceedings of the 5th Annual Symposium and Bootcamp on Hot Topics in the Science of Security, HoTSoS 2018, Raleigh, North Carolina, USA, April 10-11, 2018, Munindar P. Singh, Laurie A. Williams, Rick Kuhn, and Tao Xie (Eds.). ACM, 4:1–4:13.
- De-identification algorithm for free-text nursing notes. In Computers in Cardiology, 2005. 331–334.
- AHEAD: Adaptive Hierarchical Decomposition for Range Query under Local Differential Privacy. In CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (Eds.). ACM, 1266–1288.
- Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211–407.
- Differential Privacy: Now it’s Getting Personal. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, Sriram K. Rajamani and David Walker (Eds.). ACM, 69–81.
- CRYPTOGRU: Low Latency Privacy-Preserving Text Analysis With GRU. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 2052–2057.
- Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, James Caverlee, Xia (Ben) Hu, Mounia Lalmas, and Wei Wang (Eds.). ACM, 178–186.
- Exploring the Feasibility of ChatGPT for Event Extraction. CoRR abs/2303.03836 (2023).
- PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3690–3699.
- PCKV: Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility. In 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, Srdjan Capkun and Franziska Roesner (Eds.). USENIX Association, 967–984.
- Learning and Evaluating a Differentially Private Pre-trained Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 1178–1189.
- Survey: Leakage and Privacy at Inference Time. CoRR abs/2107.01614 (2021).
- Deidentification of free-text medical records using pre-trained bidirectional transformers. In ACM CHIL ’20: ACM Conference on Health, Inference, and Learning, Toronto, Ontario, Canada, April 2-4, 2020 [delayed], Marzyeh Ghassemi (Ed.). ACM, 214–221.
- A Framework for Fast MapReduce Processing Considering Sensitive Data on Hybrid Clouds. In 44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020. IEEE, 1357–1362.
- The HybrEx Model for Confidentiality and Privacy in Cloud Computing. In 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’11, Portland, OR, USA, June 14-15, 2011, Ion Stoica and John Wilkes (Eds.). USENIX Association.
- One-sided Differential Privacy. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 493–504.
- CrypTFlow: Secure TensorFlow Inference. In 2020 IEEE Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18-21, 2020. IEEE, 336–353.
- Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 686–695.
- FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics Framework (Flavor: Systems). Proc. VLDB Endow. 16, 6 (2023), 1439–1452.
- Towards Robust and Privacy-preserving Text Representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 25–30.
- Glamdring: Automatic Application Partitioning for Intel SGX. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017, Dilma Da Silva and Bryan Ford (Eds.). USENIX Association, 285–298.
- Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 2355–2365.
- Federated Multi-Task Learning under a Mixture of Distributions. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 15434–15447.
- Learning Differentially Private Recurrent Language Models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Frank McSherry. 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 19–30.
- PANDA: Partitioned Data Security on Outsourced Sensitive and Non-sensitive Data. ACM Trans. Manag. Inf. Syst. 11, 4 (2020), 23:1–23:41.
- Partitioned Data Security on Outsourced Sensitive and Non-Sensitive Data. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 650–661.
- AdapLeR: Speeding up Inference by Adaptive Length Reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 1–15.
- Takayuki Nishio and Ryo Yonetani. 2019. Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge. In 2019 IEEE International Conference on Communications, ICC 2019, Shanghai, China, May 20-24, 2019. IEEE, 1–7.
- Observing and Preventing Leakage in MapReduce. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015, Indrajit Ray, Ninghui Li, and Christopher Kruegel (Eds.). ACM, 1570–1581.
- SEMROD: Secure and Efficient MapReduce Over HybriD Clouds. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 153–166.
- Privacy Risks of General-Purpose Language Models. In 2020 IEEE Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18-21, 2020. IEEE, 1314–1331.
- Differentially Private In-Context Learning. CoRR abs/2305.01639 (2023).
- Scalable Private Learning with PATE. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Tempered Sigmoid Activations for Deep Learning with Differential Privacy. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 9312–9321.
- The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. Comput. Linguistics 48, 4 (2022), 1053–1101.
- CAPE: Context-Aware Private Embeddings for Private Language Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 7970–7978.
- Airavat: Security and Privacy for MapReduce. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2010, April 28-30, 2010, San Jose, CA, USA. USENIX Association, 297–312.
- Medical document anonymization with a semantic lexicon. In AMIA 2000, American Medical Informatics Association Annual Symposium, Los Angeles, CA, USA, November 4-8, 2000. AMIA.
- David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. J. Assoc. Inf. Sci. Technol. 67, 1 (2016), 148–163.
- Selective Differential Privacy for Language Modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 2848–2859.
- Congzheng Song and Ananth Raghunathan. 2020. Information Leakage in Embedding Models. In CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9-13, 2020, Jay Ligatti, Xinming Ou, Jonathan Katz, and Giovanni Vigna (Eds.). ACM, 377–390.
- Sebastian U. Stich. 2019. Local SGD Converges Fast and Communicates Little. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
- Conclave: secure multi-party computation on big data. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 3:1–3:18.
- DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM. In Proceedings of Mathematical and Scientific Machine Learning, MSML 2020, 20-24 July 2020, Virtual Conference / Princeton, NJ, USA (Proceedings of Machine Learning Research, Vol. 107), Jianfeng Lu and Rachel Ward (Eds.). PMLR, 328–351.
- ATOMO: Communication-efficient Learning via Atomic Sparsification. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 9872–9883.
- DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing. Proc. Priv. Enhancing Technol. 2021, 4 (2021), 163–183.
- A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis Metric. CoRR abs/2010.11947 (2020).
- TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 5798–5809.
- PrivKV: Key-Value Data Collection with Local Differential Privacy. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019. IEEE, 317–331.
- Differentially Private Fine-tuning of Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Differential Privacy for Text Analytics via Natural Text Sanitization. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3853–3866.
- A survey on privacy inference attacks and defenses in cloud-based Deep Neural Network. Comput. Stand. Interfaces 83 (2023), 103672.
- TextFusion: Privacy-Preserving Pre-trained Model Inference via Token Fusion. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). 8360–8371.
- Zhigang Kan (5 papers)
- Linbo Qiao (18 papers)
- Hao Yu (195 papers)
- Liwen Peng (1 paper)
- Yifu Gao (5 papers)
- Dongsheng Li (240 papers)