Differentially Private Text Sanitization

Updated 2 September 2025

Differentially private text sanitization is a set of algorithms and frameworks that redact or generalize sensitive text using randomized noise and formal privacy guarantees.
It employs methods such as metric local differential privacy, exponential mechanism sampling, and deep generative models to maintain a balance between data utility and privacy protection.
Practical implementations use embedding noise injection, context-aware techniques, and heuristic algorithms to optimize trade-offs in real-time and large-scale applications.

Differentially private text sanitization comprises a collection of algorithms and theoretical frameworks designed to transform sensitive text so that privacy guarantees, formalized by differential privacy (DP), are satisfied—even when the data is released, shared, or used in downstream analytics. This topic has emerged as central to privacy-preserving NLP, where the challenge is to redact, generalize, or synthesize text so that re-identification risk is strictly limited, utility is preserved for legitimate applications, and adversarial reconstruction attacks are minimized.

1. Theoretical Models and Mechanisms

Among the foundational approaches is the metric local differential privacy (MLDP) model, which applies randomized selection or noise to individual tokens (often via the exponential mechanism), using text similarity measures to guide substitutions (Chen et al., 2022, Yue et al., 2021). Given a privacy budget ε and a (metric or non-metric) similarity function d, the DP mechanism M selects an output y for each input token x such that

$\frac{\Pr[M(x) = y]}{\Pr[M(x') = y]} \leq e^{\epsilon \cdot d(x,x')}$

for all possible tokens x, x′ and outputs y.

In extensions like Utility-optimized Metric-LDP (UMLDP), semantic sensitivity is explicitly modeled: the output distribution for a “sensitive” token is made less deterministic than for a “non-sensitive” token, striking a privacy-utility balance gauged by parameters ε, ε₀, and others (Yue et al., 2021). Mechanisms such as CusText (Chen et al., 2022) enhance MLDP by mapping each token to a customized output set—improving semantic preservation and supporting non-metric similarity measures.

For document-level semantic disclosure, (C, g(C))-sanitization quantifies the risk via information-theoretic concepts, using information content ( $IC$ ) and pointwise mutual information ( $PMI$ ), setting thresholds so that no term or group of terms reveals more information about a sensitive entity c than allowed by a designated generalization g(c):

$\text{PMI}(c; t) \leq IC(g(c))$

This model improves flexibility and interpretable privacy control over earlier C-sanitization approaches (Sánchez et al., 2017).

Deep generative models, such as dp-GAN, offer an alternative: the generator is trained via DP (typically by injecting Gaussian noise into gradient updates), then used to synthesize an unlimited set of privacy-safeguarded texts (Zhang et al., 2018). The DP guarantee holds at the model level, and utility is evaluated by downstream analytic tasks.

2. Practical Implementation and Algorithmic Details

Implementation strategies for token sanitization typically involve several components:

Embedding and Noise Injection: Each token x is mapped to an embedding vector $\phi(x)$ ; noise (e.g., sampled from a multivariate Laplace or Gamma distribution) is added prior to mapping back to a discrete token using nearest neighbor search (Carpentier et al., 18 Nov 2024, Arnold et al., 2023).
Exponential Mechanism Sampling: For MLDP and CusText, output tokens are sampled from customized sets according to the exponential mechanism, using a scoring function u(x, y) based on similarity. For instance, in CusText, $u(x,y) = -d'(x,y)$ is normalized, and the output distribution is

$\Pr[\text{sample}(x) = y] = \frac{\exp\left(\frac{\epsilon u(x,y)}{2\Delta u}\right)}{\sum_{y'\in \mathcal{Y}'} \exp\left(\frac{\epsilon u(x,y')}{2\Delta u}\right)}$

where $\mathcal{Y}'$ is the customized output set for x.

Context-aware Mechanisms: To address ambiguity, sense embeddings and word sense disambiguation steps are used, determining the appropriate meaning of each token in its context before sanitization (Arnold et al., 2023). This approach improves downstream classification accuracy—such as a 6.05% boost on WiC word sense tasks.
Greedy Heuristic Algorithms: For semantic disclosure (as in (C, g(C))-sanitization), efficient greedy algorithms operate on document contexts, sorting terms by information content, evaluating against PMI thresholds, and replacing risky terms with the most utility-preserving generalizations. Empirical evaluation is critical as optimal solutions are NP-hard (Sánchez et al., 2017).
Streaming Sanitization: For large or real-time data, the stream is partitioned into blocks; a DP sanitizer acts offline on each block, then outputs are stitched together. Statistical utility guarantees are derived via error bounds (e.g., $|E_{x\in S}[c(x)] - E_{x\in S'}[c(x)]| \leq \alpha$ ) with overall $(\epsilon, \delta)$ -DP maintained by careful composition (Kaplan et al., 2021).

3. Utility, Privacy Protection, and Evaluation

A central theme is balancing privacy with utility. Empirical metrics include:

Downstream Task Performance: Utility is measured by accuracy on classification (SST-2, QNLI), semantic similarity (MedSTS, BertTopic cosine similarity), topic modeling, and summarization tasks (Chen et al., 2022, Yue et al., 2021, Carpentier et al., 18 Nov 2024).
Privacy Protection: Recall, precision, and F1-score quantify correct redaction of sensitive terms. In generative approaches, empirical privacy is validated by resistance to inference attacks or successful masking of PII (Xin et al., 28 Apr 2025, Pang et al., 16 Oct 2024).
Information Content and PMI: Quantifies semantic risk and guides threshold selection (Sánchez et al., 2017).
Attack Success Rate (ASR): The fraction of sensitive tokens reconstructed by adversarial attacks, especially under contexts or with shadow datasets, is a key vulnerability indicator. Recent work derives tight Bayesian bounds for ASR in both context-free and contextual regimes:

$x' = \arg\max_{x'\in X} \frac{Pr(y|x')\cdot Pr(x')}{Pr(y)}$

and

$x' = \arg\max_{x'\in X} \frac{Pr(y|x') \cdot Pr(x') \cdot Pr(c|x',y)}{Pr(y) \cdot Pr(c|y)}$

where $c$ represents sanitized context (Tong et al., 22 Oct 2024).

4. Adversarial Reconstruction and Vulnerabilities

Recent studies highlight contextual vulnerability: Word-level DP sanitization (randomizing tokens independently) leaves semantic and structural clues exploitable by adversaries, especially LLMs. Both black-box and white-box attacks demonstrate high recovery rates of the original sensitive content—up to 94% under practical privacy budgets ( $\epsilon$ ) for leading LLMs (Pang et al., 16 Oct 2024).

Practical attacks are implemented both via prompt engineering (instruction-based black-box) and direct fine-tuning with auxiliary data (white-box), using linking via LLM inference and dense retrieval. Context-aware attacks (integrating surrounding sanitized context) further enhance the success of reconstruction (Tong et al., 22 Oct 2024, Meisenbacher et al., 26 Aug 2025).

These vulnerabilities persist despite formal DP guarantees, motivating ongoing reassessment of theoretical and practical effectiveness.

5. Mitigation Strategies and Future Directions

Several mitigation and improvement strategies are discussed in the literature:

Semantic-level Protection: Pure identifier removal and surface-level scrubbing leave nuanced information exposed; frameworks for atomic claim decomposition, linking, and semantic scoring provide finer granularity in privacy evaluation (Xin et al., 28 Apr 2025).
Post-processing with LLMs: Controlled LLM reconstruction as a post-processing step, justified by the post-processing closure property of DP, can enhance both utility and equitable privacy by disguising text structure and reducing telltale fixed-length or rigid mapping artifacts (Meisenbacher et al., 26 Aug 2025).
Utility Prediction Middleware: Lightweight, local models (SLMs) can forecast the utility of sanitized prompts before submission to costly online LLMs, minimizing resource expenditure and avoiding waste when utility losses are prohibitive (Carpentier et al., 18 Nov 2024).
Context-aware Sanitization: Incorporating context via sense embeddings and disambiguation before noise injection addresses the semantic drift problem and improves task performance on ambiguous terms (Arnold et al., 2023).
Robust Mechanism Engineering: The correct implementation of the underlying DP mechanisms (e.g., proper support and inversion in Laplace reparametrization, careful handling of nearest neighbor search variants) is critical—mismatches can nullify privacy guarantees (Habernal, 2022, Carpentier et al., 18 Nov 2024).
Comprehensive Benchmarking: Empirical attack success rates, privacy-utility trade-off metrics, and semantic distance scores must be used in tandem to evaluate mechanisms with respect to current and evolving adversarial capabilities (Xin et al., 28 Apr 2025, Tong et al., 22 Oct 2024).

6. Challenges and Open Problems

The field faces significant open problems:

Trade-off Optimization: Lower privacy budgets (small ε) reduce empirical risk at the expense of data utility; concerted effort is needed to optimize mechanisms for deployment-worthy utility under robust privacy constraints (Chen et al., 2022, Xin et al., 28 Apr 2025).
Domain Adaptivity: Uniform privacy allocation across tokens is rarely optimal; future directions include adaptive privacy parameters tied to token or context sensitivity (Chen et al., 2022, Yue et al., 2021).
Streaming and Scalability: High-dimensionality, vocabulary size, and the need for semantic consistency make streaming and scalable sanitization nontrivial (Kaplan et al., 2021).
Model Memorization: LLM memorization and data contamination are emergent threats—mechanisms for unlearning or end-to-end privacy protection (not just at the preprocessing stage) need further research (Pang et al., 16 Oct 2024, Meisenbacher et al., 26 Aug 2025).
Reference Implementations: Choices between ANN and ENN can dramatically affect protection, highlighting the need for standardized, auditable implementations (Carpentier et al., 18 Nov 2024).

7. Significance for Research and Practice

Differentially private text sanitization represents a rigorous interface between privacy theory and natural language engineering. It is foundational for compliance (e.g., GDPR, CCPA), enabling sharing and analytics on sensitive text data without endangering individuals’ privacy. However, the ongoing advances in adversarial modeling, reconstruction attacks, and LLM capabilities require ongoing innovation, benchmarking, and best-practice evolution in both mechanisms and evaluation frameworks.

The literature underscores the necessity of nuanced, context-aware, and semantically grounded approaches—moving beyond surface-level identifier removal—while maintaining utility and allowing transparent validation for both organizational and regulatory scrutiny. Theoretical models, robust optimization, careful implementation, and adversarial assessment together define the state of the art and the future trajectory of differentially private text sanitization.