A Customized Text Sanitization Mechanism with Differential Privacy (2207.01193v2)
Abstract: As privacy issues are receiving increasing attention within the NLP community, numerous methods have been proposed to sanitize texts subject to differential privacy. However, the state-of-the-art text sanitization mechanisms based on metric local differential privacy (MLDP) do not apply to non-metric semantic similarity measures and cannot achieve good trade-offs between privacy and utility. To address the above limitations, we propose a novel Customized Text (CusText) sanitization mechanism based on the original $\epsilon$-differential privacy (DP) definition, which is compatible with any similarity measure. Furthermore, CusText assigns each input token a customized output set of tokens to provide more advanced privacy protection at the token level. Extensive experiments on several benchmark datasets show that CusText achieves a better trade-off between privacy and utility than existing mechanisms. The code is available at https://github.com/sai4july/CusText.
- Deep learning with differential privacy. In CCS, pages 308–318.
- Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78.
- Large-scale differentially private BERT. In EMNLP (Findings), pages 6481–6491.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, pages 267–284.
- Extracting training data from large language models. In USENIX Security Symposium, pages 2633–2650.
- Broadening the scope of differential privacy using metrics. In Privacy Enhancing Technologies (PETS), pages 82–102.
- Local privacy and statistical minimax rates. In FOCS, pages 429–438.
- An efficient DP-SGD mechanism for large scale NLU models. In ICASSP, pages 4118–4122.
- Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography (TCC), pages 265–284.
- Privacy- and utility-preserving textual analysis via calibrated multivariate perturbations. In WSDM, pages 178–186.
- Leveraging hierarchical representations for preserving privacy and utility in text. In ICDM, pages 210–219.
- James E. Gentle. 2009. Monte Carlo methods for statistical inference. In Computational Statistics, pages 417–433. Springer.
- Jack Hessel and Alexandra Schofield. 2021. How effective is BERT without word ordering? Implications for language understanding and data privacy. In ACL/IJCNLP (Short Papers), pages 204–211.
- Survey: Leakage and privacy at inference time. arXiv:2107.01614.
- t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, pages 106–115.
- Large language models can be strong differentially private learners. In ICLR.
- Differentially private representation for NLP: Formal guarantee and an empirical study on privacy and fairness. In EMNLP (Findings), pages 2355–2365.
- L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1):3:1–3:52.
- Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In FOCS, pages 94–103.
- Efficient estimation of word representations in vector space. arXiv:1301.3781.
- Privacy regularization: Joint privacy-utility optimization in LanguageModels. In NAACL-HLT, pages 3799–3807.
- Counter-fitting word vectors to linguistic constraints. In NAACL-HLT, pages 142–148.
- Takao Murakami and Yusuke Kawamoto. 2019. Utility-optimized local differential privacy mechanisms for distribution estimation. In USENIX Security Symposium, pages 1877–1894.
- A utility-optimized framework for personalized private histogram estimation. IEEE Trans. Knowl. Data Eng., 31(4):655–669.
- GloVe: Global vectors for word representation. In EMNLP, pages 1532–1543.
- Natural language understanding with privacy-preserving BERT. In CIKM, pages 1488–1497.
- Gerard Salton and Chris Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 24(5):513–523.
- Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In CCS, pages 377–390.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
- MedSTS: a resource for clinical semantic textual similarity. Lang. Resour. Eval., 54(1):57–72.
- Transformers: State-of-the-art natural language processing. In EMNLP (Demos), pages 38–45.
- Differential privacy for text analytics via natural text sanitization. In ACL/IJCNLP (Findings), pages 3853–3866.
- Ying Zhao and Jinjun Chen. 2022. A survey on differential privacy for unstructured data content. ACM Comput. Surv., 54(10s):207:1–207:28.