Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Customized Text Sanitization Mechanism with Differential Privacy (2207.01193v2)

Published 4 Jul 2022 in cs.CR and cs.CL

Abstract: As privacy issues are receiving increasing attention within the NLP community, numerous methods have been proposed to sanitize texts subject to differential privacy. However, the state-of-the-art text sanitization mechanisms based on metric local differential privacy (MLDP) do not apply to non-metric semantic similarity measures and cannot achieve good trade-offs between privacy and utility. To address the above limitations, we propose a novel Customized Text (CusText) sanitization mechanism based on the original $\epsilon$-differential privacy (DP) definition, which is compatible with any similarity measure. Furthermore, CusText assigns each input token a customized output set of tokens to provide more advanced privacy protection at the token level. Extensive experiments on several benchmark datasets show that CusText achieves a better trade-off between privacy and utility than existing mechanisms. The code is available at https://github.com/sai4july/CusText.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Deep learning with differential privacy. In CCS, pages 308–318.
  2. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78.
  3. Large-scale differentially private BERT. In EMNLP (Findings), pages 6481–6491.
  4. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, pages 267–284.
  5. Extracting training data from large language models. In USENIX Security Symposium, pages 2633–2650.
  6. Broadening the scope of differential privacy using metrics. In Privacy Enhancing Technologies (PETS), pages 82–102.
  7. Local privacy and statistical minimax rates. In FOCS, pages 429–438.
  8. An efficient DP-SGD mechanism for large scale NLU models. In ICASSP, pages 4118–4122.
  9. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography (TCC), pages 265–284.
  10. Privacy- and utility-preserving textual analysis via calibrated multivariate perturbations. In WSDM, pages 178–186.
  11. Leveraging hierarchical representations for preserving privacy and utility in text. In ICDM, pages 210–219.
  12. James E. Gentle. 2009. Monte Carlo methods for statistical inference. In Computational Statistics, pages 417–433. Springer.
  13. Jack Hessel and Alexandra Schofield. 2021. How effective is BERT without word ordering? Implications for language understanding and data privacy. In ACL/IJCNLP (Short Papers), pages 204–211.
  14. Survey: Leakage and privacy at inference time. arXiv:2107.01614.
  15. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, pages 106–115.
  16. Large language models can be strong differentially private learners. In ICLR.
  17. Differentially private representation for NLP: Formal guarantee and an empirical study on privacy and fairness. In EMNLP (Findings), pages 2355–2365.
  18. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1):3:1–3:52.
  19. Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In FOCS, pages 94–103.
  20. Efficient estimation of word representations in vector space. arXiv:1301.3781.
  21. Privacy regularization: Joint privacy-utility optimization in LanguageModels. In NAACL-HLT, pages 3799–3807.
  22. Counter-fitting word vectors to linguistic constraints. In NAACL-HLT, pages 142–148.
  23. Takao Murakami and Yusuke Kawamoto. 2019. Utility-optimized local differential privacy mechanisms for distribution estimation. In USENIX Security Symposium, pages 1877–1894.
  24. A utility-optimized framework for personalized private histogram estimation. IEEE Trans. Knowl. Data Eng., 31(4):655–669.
  25. GloVe: Global vectors for word representation. In EMNLP, pages 1532–1543.
  26. Natural language understanding with privacy-preserving BERT. In CIKM, pages 1488–1497.
  27. Gerard Salton and Chris Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 24(5):513–523.
  28. Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In CCS, pages 377–390.
  29. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
  30. MedSTS: a resource for clinical semantic textual similarity. Lang. Resour. Eval., 54(1):57–72.
  31. Transformers: State-of-the-art natural language processing. In EMNLP (Demos), pages 38–45.
  32. Differential privacy for text analytics via natural text sanitization. In ACL/IJCNLP (Findings), pages 3853–3866.
  33. Ying Zhao and Jinjun Chen. 2022. A survey on differential privacy for unstructured data content. ACM Comput. Surv., 54(10s):207:1–207:28.
Citations (25)

Summary

We haven't generated a summary for this paper yet.