Differential Privacy for Text Analytics via Natural Text Sanitization (2106.01221v1)

Published 2 Jun 2021 in cs.CL and cs.CR

Abstract: Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose LLMs and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacy-preserving natural language processing over the BERT LLM with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.

Citations (64)

View on Semantic Scholar

Summary

The paper introduces a novel Utility-optimized Metric LDP (UMLDP) framework for differential privacy in text analytics, improving classification accuracy by 28%.
It presents two sanitization mechanisms, SanText and its enhanced version, that leverage semantic similarity to substitute tokens while preserving text utility.
The method enables privacy-preserving NLP by generating human-readable text, making it viable for applications in sentiment analysis, healthcare, and finance.

Differential Privacy for Text Analytics via Natural Text Sanitization

The paper "Differential Privacy for Text Analytics via Natural Text Sanitization" presents a robust framework for ensuring differential privacy in text analytics. It addresses the challenge of protecting sensitive information inherent in text data while maintaining utility for NLP tasks. The authors propose a novel local differential privacy (LDP) notion, Utility-optimized Metric LDP (UMLDP), and introduce sanitization mechanisms that strategically preserve semantic similarity and allocate privacy budget by focusing on sensitive words.

Framework Overview

The paper first delineates the inadequacies of conventional methods in balancing data utility and privacy in high-dimensional text representations. It proposes a direct approach to text sanitization that generates human-readable, sanitized text documents to integrate seamlessly with NLP pipelines, offering transparency and explainability. The authors design an NLP pipeline consisting of user-side sanitation and service provider-side NLP modeling, with techniques for both pretraining and fine-tuning based on sanitized texts.

Technical Contributions

The paper introduces two text sanitization mechanisms - SanText and its enhanced version, SanText. Both leverage a modified exponential mechanism that samples sanitized tokens based on their semantic similarity to the original, quantified using a pre-trained word embedding model.

SanText: Employs semantic similarity to decide token substitutions within a differential privacy framework, focusing on high utility without segueing into high-dimensional space, thereby omitting the curse of dimensionality.
SanText: Divides the vocabulary into sensitive and non-sensitive zones, maneuvering noise addition to preserve privacy in the sensitive zone while maintaining utility where privacy is less critical.

Under UMLDP, SanText achieves $28\%$ accuracy improvements over counterparts in text classification tasks. It outperforms existing solutions by utilizing token-level local privacy measures, marking a leap in maintaining semantic relevance while ensuring privacy.

Practical Implications

The sanitized texts enable robust training of models, such as BERT, within a privacy-preserving environment. The authors assert their approach does not compromise privacy through inference attacks, contrasting with some models where higher accuracy inadvertently facilitates leakage of private data.

Their results across different NLP tasks like sentiment analysis, medical semantic similarity, and question-answering indicated considerable improvements in utility even at rigorous privacy settings. This, combined with efficient computation times, positions their approach as a viable solution for widespread practical applications.

Theoretical and Future Directions

UMLDP broadens the scope of differential privacy by effectively integrating utility considerations. This conceptual contribution potentially influences future advancements in privacy notions and encourages investigations into more sophisticated definitions of sensitivity levels in text data.

The paper speculates on the long-term integration of privacy-preserving techniques into AI, anticipating collaboration across domains to refine sensitivity delineation and user-specific customization. The effectiveness of differentially private machine learning, as demonstrated, suggests promising avenues for further exploration and application within sensitive domains like healthcare and finance.

Conclusion

This paper significantly progresses differentially private NLP, pioneering a natural, explainable approach to text sanitization that harmonizes utility with privacy. The innovative framework facilitates privacy self-assessment by stakeholders and exhibits potential for broader interdisciplinary research, aiming to eventually make privacy-preserving NLP scalable and universally applicable.

PDF Markdown

Related Papers

GitHub

GitHub - xiangyue9607/SanText: Code for Findings of ACL 2021 "Differential Privacy for Text Analytics via Natural Text Sanitization" (28 stars)

YouTube

Show All Videos