Papers
Topics
Authors
Recent
Search
2000 character limit reached

MaskSQL: Privacy-Preserving Text-to-SQL

Updated 4 October 2025
  • MaskSQL is a privacy-preserving text-to-SQL framework that uses deterministic abstraction to mask sensitive schema elements and literals while retaining essential relational structure.
  • It employs a hybrid approach by locally abstracting sensitive tokens and remotely generating SQL with LLMs, ensuring compliance with regulations like GDPR and HIPAA.
  • The framework offers a tunable privacy-utility tradeoff, outperforming traditional SLM approaches and closely matching the performance of unconstrained LLM configurations.

MaskSQL is a privacy-preserving text-to-SQL framework that uses abstraction to mask sensitive schema elements and literal values in LLM prompts while retaining critical relational structure required for accurate SQL generation. Developed in response to regulatory demands (such as GDPR and HIPAA) and the impracticality of deploying proprietary, resource-intensive LLMs locally, MaskSQL enables organizations to leverage the reasoning capacity of remote LLMs for SQL synthesis without exposing private data. By applying fine-grained abstraction rather than redaction or generalization, MaskSQL achieves a tunable balance between privacy and utility, outperforming other small LLM (SLM) approaches and approaching the accuracy of unconstrained LLM configurations (Abedini et al., 27 Sep 2025).

1. The Privacy Dilemma in Text-to-SQL Systems

Text-to-SQL systems frequently require natural language queries combined with full database schema and potentially user-level literal values as input to an LLM, exposing sensitive metadata and personally identifiable information to third-party infrastructure. This is incompatible with regulatory frameworks demanding data localization and strict privacy (e.g., requiring no schema leakage). Locally deployed SLMs, while privacy-protective, lack the reasoning and generalization ability needed for high-accuracy complex SQL. MaskSQL was developed as a hybrid framework that abstracts sensitive tokens before LLM inference, leveraging abstract representations to safeguard private information in cross-boundary LLM calls, then reconstructs the final query locally with a private mapping.

2. Abstraction Mechanism

MaskSQL implements privacy through deterministic abstraction, which replaces all policy-selected sensitive tokens with abstract symbols via a bijective mapping. The framework identifies sensitive tokens—including table names (wWw \in \mathbb{W}), column names, and literal values—using a locally deployed SLM-based schema and value linker. The masking function f(w)f(w) outputs an abstract symbol SS for any token ww that is detected as sensitive, with the property that the mapping between abstract symbol and concrete value can be perfectly reversed after LLM inference.

Pipeline Stages:

  • Local Abstraction: Analyze and mask the input NL query and schema according to a user-defined policy, mapping elements such as Patients \to T₁, hiv_status \to C₃, “positive” \to V₂.
  • Remote SQL Generation: Send the abstracted NL and schema to the remote LLM. The LLM generates an abstract SQL query, retaining the relational structure (e.g., SELECT C₃ FROM T₁ WHERE C₄ = V₂).
  • Local Reconstruction: Use the stored mapping to de-abstract the SQL back into the original schema and literal values, with possible correction using the local SLM if abstraction artifacts are detected.

This abstraction, unlike redaction (removal) or generalization (category replacement), preserves relational alignment, allowing the LLM to reason about query structure without seeing actual data or schema names.

3. Comparison with Redaction and Generalization

MaskSQL’s abstraction preserves more information than redaction (which erases tokens, risking syntactic and semantic misalignment) and avoids the utility loss associated with generalization (which replaces tokens with broad categories that can weaken schema linkage). With abstraction, positional and contextual mappings remain intact; if a table is referenced multiple times, the same symbol is used, enabling the LLM to synthesize joins and filters without schema exposure. This controlled substitution maintains task utility while blocking content exposure.

Protection Technique Privacy Strength Utility Preservation
Redaction Highest Weak
Generalization Moderate Variable
Abstraction (MaskSQL) Tunable Strong

This framework supports adjustment of masking granularity, offering organizations flexibility to trade accuracy for enhanced privacy or vice versa.

4. Performance Evaluation

On the BIRD benchmark, MaskSQL’s execution accuracy is measured across privacy policies:

  • Full Policy: Abstracts all schema elements and literal values. Execution accuracy \approx 55.66%, outperforming state-of-the-art SLM methods.
  • Category-Based Policy: Masks only tokens related to “names, occupations, or locations,” achieving accuracy \approx 62.66%.

Compared to unconstrained LLM prompting (e.g., via GPT-4.1 direct), MaskSQL exhibits a small performance gap, but does not compromise on privacy. Additional metrics reported include competitive token efficiency and high adversarial re-identification scores—indicating that symbol-to-value links are difficult to infer by adversaries given only mapped outputs.

5. Privacy-Utility Tradeoff and Policy Control

A notable property of MaskSQL is tunable control over the amount and type of abstraction via a policy engine. Full abstraction maximizes privacy but may reduce utility due to loss of name-specific information, while selective abstraction (category-based or column-level) maintains higher execution accuracy. Organizations can customize the policy by deciding which schema elements and literals require masking as dictated by regulation, contractual requirements, or internal policy, deploying MaskSQL with confidence over privacy exposure.

6. Application Domains and Generalization

MaskSQL is suited for domains with high privacy demands, such as healthcare (e.g., masking patient identifiers and sensitive fields), finance (protecting account names, transactional data), and enterprise environments where the schema itself constitutes business intelligence. The abstraction approach is applicable to other structured-to-natural language tasks such as code synthesis, debugging, and data analysis, offering robust privacy guarantees for a variety of LLM-powered workflows. The method provides a blueprint for hybrid local-remote deployment, maintaining data privacy through abstraction while harnessing external compute for complex reasoning.

7. Limitations and Future Directions

MaskSQL currently relies on accurate local SLM-based schema and value linking; any error in token mapping can result in misalignment between the abstract and original schema, impacting reconstruction. It does not currently employ formal privacy mechanisms such as differential privacy, although strong empirical privacy metrics are reported. There is a residual performance gap between privacy-constrained and unconstrained approaches. Future work may explore tighter privacy guarantees (e.g., provable privacy), improved schema linking and error correction, and adversarial defenses for re-identification attacks.

MaskSQL represents a substantive advance in privacy-preserving SQL generation, allowing organizations to benefit from state-of-the-art LLM models while strictly limiting exposure of sensitive information. By abstracting tokens rather than redacting or generalizing, it sustains high task utility and competitive performance, establishing a flexible and robust paradigm for secure large model integration in sensitive computational environments (Abedini et al., 27 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MaskSQL.