Privacy-Preserving Transformer Inference

Updated 18 January 2026

Privacy-preserving transformer inference is a suite of techniques that integrate cryptographic primitives like secret-sharing and homomorphic encryption with secure protocol design to protect both user data and model parameters.
The approaches employ SMPC, HE, and hybrid protocols to handle both linear and non-linear transformer components efficiently, reducing communication overhead while maintaining accuracy.
Empirical benchmarks demonstrate near-cleartext performance on models such as BERT and GPT2, ensuring robust security under semi-honest threat models.

Privacy-preserving transformer inference encompasses a spectrum of cryptographic, algorithmic, and architectural approaches that enable the deployment of large transformer models (e.g., BERT, RoBERTa, GPT, ViT) in settings where user input data and/or model parameters must remain confidential. State-of-the-art frameworks combine secret-sharing, homomorphic encryption, and protocol-level system design to efficiently compute full transformer inference pipelines with rigorous privacy guarantees, targeting adversarially strong “semi-honest” threat models. This article reviews core methodologies, recent algorithmic advances, empirical results, and future directions as crystallized in the latest research literature.

1. Threat Models and System Architectures

The canonical privacy-preserving transformer inference protocol involves a client holding private input data (text tokens, images) and a server holding proprietary model parameters. The goal is to compute model predictions such that (i) the model weights are never revealed to the client, and (ii) the input data are never revealed to the server, formalized by the simulation-based security definition in the semi-honest model: for any adversary controlling a single party, its protocol view is simulatable from its own input and output (Li et al., 15 May 2025).

Protocols broadly decompose into three architecture types:

SMPC-based frameworks use additive secret-sharing across two or more servers to protect both data and model, with secure computation for all required functionality (Li et al., 2022, Luo et al., 2024, Ding et al., 2023, Luo et al., 2024).
HE-based solutions offload computation-intensive linear layers to the server, with all inputs and intermediate activations encrypted under a homomorphic encryption (HE, e.g., BFV or CKKS) public key (Chen et al., 2022, Zimerman et al., 2023). Some variants also push certain non-linearities into the clear via client-in-the-loop.
Hybrid HE–SMPC protocols assign linear computation to HE and non-linear layers to small interactive SMPC subprotocols, often using efficient conversion protocols between ciphertexts and shares (Xu et al., 2024, Xu et al., 27 Aug 2025, Liu et al., 2023, Wang et al., 2024, Ding et al., 2023).

In all cases, linear operations can be performed non-interactively over encrypted data or shares, while non-linearities (GeLU, Softmax, LayerNorm, ReLU, division, sqrt, etc.) pose primary efficiency and privacy bottlenecks.

2. Cryptographic Protocols for Private Inference

The dominant cryptographic primitives for PTI (privacy-preserving transformer inference) are:

Additive Secret Sharing (ASS): The input or parameter is split as $x = x_1 + x_2$ over a large finite ring, and both client and server hold one share each. All linear operations are locally performed; multiplications rely on Beaver triple preprocessing (Li et al., 2022, Luo et al., 2024).
Homomorphic Encryption (HE): Schemes like BFV (integer, modulus switching) and CKKS (approximate reals, rescaling) allow additions and multiplications over ciphertexts (Chen et al., 2022, Zimerman et al., 2023, Xu et al., 27 Aug 2025).
Oblivious Transfer (OT)/Garbled Circuits (GC): Non-linear operations, particularly comparisons (ReLU, max), function evaluation (exp, reciprocal, sqrt) or piecewise prediction, are dispatched to lightweight MPC or OT-based protocols (Ding et al., 2023, Wang et al., 2024).
Conversion Protocols: Efficient and provably secure mechanisms for switching data between HE ciphertexts and SMPC shares, critical to minimize costly interaction and communication (Xu et al., 27 Aug 2025, Liu et al., 2023).

Hybrid protocols, such as BLB, carefully fuse adjacent linear operators, operate all possible blocks in HE, and only invoke SMPC/OT when strictly necessary, eliminating spurious conversions and minimizing overhead (Xu et al., 27 Aug 2025).

3. Efficient Secure Evaluation of Non-Linear Functions

The principal communication and computational bottleneck in privacy-preserving transformer inference arises from non-linearities, especially Softmax, GeLU, and LayerNorm.

Communication-Efficient Approximations:

Comet replaces GeLU, Softmax, and LayerNorm routines with a unified inverse-square-root approach, using a “double approximation” for the initial guess (local computation, no LUT/OT) and 3–4 rounds of Newton’s method in MPC, supported by a technique termed “share flooding” to align exponent bits for guaranteed convergence. This reduces non-linear layer communication up to 3.9× and speeds up inference up to 3.5× compared to LUT/polynomial-based protocols, with negligible accuracy degradation ( $<$ 0.5% on GLUE) (Xu et al., 2024).
Oblivious Piecewise Polynomial Evaluation (OPPE): As in East, all complex activations are approximated via $m$ -segment, degree- $d$ piecewise polynomials, yielding accurate, oblivious protocols for GELU/tanh; accompanying protocols for Softmax compute exponentiations and reciprocal via Newton iteration, while LayerNorm is handled by algebraic rewriting and fast secure inverse square root (Ding et al., 2023).
2Quad and SMU/Softmax* Substitutions: SecFormer, MPCFormer, and Comet all implement quadratic or ReLU-based approximations to Softmax and GeLU, which drastically cut communication for non-linearities (SecFormer achieves $3.6\times$ speedup vs. PUMA on BERT-base, using a 2Quad Softmax and Goldschmidt reciprocal) (Luo et al., 2024, Li et al., 2022, Xu et al., 2024).
Encrypted Polynomial Reduction: CipherPrune assigns low-degree polynomial approximations to tokens deemed unimportant via a layer-wise encrypted importance scoring system, optimizing the assignation with gradient-based search to maximize both speed and accuracy (Zhang et al., 24 Feb 2025).

Communication reduction in non-linear layers is the critical breakthrough enabling practical PTI on large transformer models.

4. Scalable and Specialized Architectures

Specialized architectural and protocol designs unlock further efficiency and system-level scalability.

Mixture-of-Experts (MoE)/Sparsity: SecMoE enables private, sparse MoE inference by securely routing tokens to selected experts via an OT-based Select-Then-Compute paradigm, performing oblivious parameter selection and single-expert computation, reducing communication by up to 7.1× (with a 15.2× total runtime increase for 63× model capacity scale-up) (Shen et al., 11 Jan 2026).
Token Pruning: CipherPrune adaptively prunes less salient tokens at each layer in encrypted space, cutting quadratic attention overhead and overall data volume, achieving 6–10× speedups with ≤0.2% accuracy loss on GLUE (Zhang et al., 24 Feb 2025).
Parameter-Efficient Fine-Tuning (PEFT): CryptPEFT confines private computation to adapters in a one-way flow, searching the adapter design space to trade off minimal encrypted workload against utility, realizing 20×–291× speedup over global MPC and maintaining accuracy within 1% (Xia et al., 17 Aug 2025).

Practical systems routinely combine token pruning, sparse/expert layers, and architectural decompositions to amortize cryptographic cost.

5. Empirical Results and Protocol-Level Tradeoffs

Extensive benchmarks on BERT, RoBERTa, GPT2, and ViT models (texts, images) demonstrate the effectiveness of these protocols under realistic LAN/WAN network settings.

Performance and Bandwidth:

Pure-SMPC protocols achieve low latency (sub-minute) under high-bandwidth, but scale poorly with sequence length and non-linearity count (Li et al., 2022, Luo et al., 2024).
Hybrid and communication-efficient protocols (Comet, BLB, CipherPrune) bring communication per GLUE sample down to ∼5–12 GB and runtime to seconds–minutes range, matching or beating cleartext accuracy (within ≤1%) (Xu et al., 2024, Xu et al., 27 Aug 2025, Zhang et al., 24 Feb 2025).
HE-only protocols (THE-X, conversion-to-polynomial) achieve minimal communication (a few hundred MB per query) at the cost of substantially higher local compute and often interactive “client-in-the-loop” for ReLU (Chen et al., 2022, Zimerman et al., 2023).
End-to-end, CipherPrune outperforms BOLT and Iron by $>3\times$ speedup at scale (Zhang et al., 24 Feb 2025), while BLB sets the communication/latency frontier (21×/13× improvement vs. prior hybrid methods on large BERT and GPT2) (Xu et al., 27 Aug 2025).

Accuracy Impact:

State-of-the-art frameworks retain >98% of baseline accuracy on classification and NLU tasks. For most quadratic/SMU or piecewise polynomial-based designs, error versus cleartext is ≤1.2 × 10^-2 (Shen et al., 11 Jan 2026, Xu et al., 2024, Ding et al., 2023).

Trade-Offs:

Quadratic vs. higher-degree polynomial approximations shift communication/latency cost in center for large models.
Fixed-point arithmetic incurs constant overhead in share flooding, but is tractable with bounded activation ranges (Xu et al., 2024, Ding et al., 2023).
Slight utility drop (<1%) in unified nonlinear replacements (e.g., SMU, 2Quad) is acceptable for practical deployment; special cases (e.g., RTE) may see uplift (Xu et al., 2024).

6. Privacy, Security, and Future Directions

All protocols adhere to the “semi-honest” adversarial model (honest execution, transcript attacks), enforcing privacy of both input and model weights via established simulation-based proofs (UC indistinguishability). Extensions to malicious adversary settings require additional mechanisms (MACs, consistency checks, ZK proofs) and remain an active research area (Xu et al., 2024, Li et al., 15 May 2025, Ding et al., 2023).

Classical privacy criteria (e.g., differential privacy) are being adapted: NVDP introduces a principled nonparametric variational information bottleneck, yielding Bayesian DP and Rényi DP guarantees for sanitized representations (Zein et al., 5 Jan 2026).

Ongoing and Future Directions:

Robustness to distributional shift: Share flooding and polynomial-range guarantees for out-of-distribution activations.
Malicious security and multi-party models: Extending two-party protocols to honest/dishonest-majority SME, stronger semi-malicious adversaries, and TEE/SGX settings (Xu et al., 2024, Li et al., 15 May 2025).
Automated protocol-aware architecture search: As in CryptPEFT and CipherPrune, searching over model and protocol hyperparameters (adapter depth/width, pruning thresholds, polynomial degrees) for best accuracy/efficiency balance.
Generative and open-ended tasks: Extending high-accuracy, low-latency PTI to autoregressive decoding and open-domain generation, currently unsolved at reasonable overhead (Li et al., 15 May 2025).
GPU acceleration and hardware optimizations: Most frameworks remain CPU-bound; specialized GPU kernels for (I) HE arithmetic, (II) secret-sharing and OT, and (III) garbled circuit evaluation are critical for deployment (Li et al., 15 May 2025, Liu et al., 2023).

7. Summary Table: Representative Protocols, Approximation Strategies, and Impact

Framework / Protocol	Main Cryptographic Primitives	Non-linear Handling	Communication Reduction	Typical Accuracy Drop	Reference
Comet	Hybrid ASS-HE, MPC	Unified 1/√u via SMU/Softmax*	3.9×–4×	≤0.5%	(Xu et al., 2024)
SecMoE	ASS, HE, OT-based routing	Select-then-compute	1.8×–7.1×	≤1.2 × 10^-2	(Shen et al., 11 Jan 2026)
CipherPrune	HE + ASS/OT, pruning, adaptation	Polynomial reduction	6–10× (128–512 tokens)	<0.2%	(Zhang et al., 24 Feb 2025)
BLB	Fused CKKS–ASS, efficient conv.	Fused linear/poly approx	21× comm., 13× lat.	≤0.3%	(Xu et al., 27 Aug 2025)
CryptPEFT	Adapter-focused OWC + MPC	Adapter only, linear attn	20–291× (vs. PEFT SOTA)	~1%	(Xia et al., 17 Aug 2025)
East	ASS, optimized OPPE, secure softmax	Piecewise poly+Newton	1.8× non-linear comm.	0% (exact BERT)	(Ding et al., 2023)

Empirical evidence indicates that modern PTI achieves near-plaintext accuracy at tractable communication and runtime cost for moderate sequence lengths, with several protocols supporting models with over a hundred million parameters. Methodological progress centers on unifying and streamlining cryptographic operations in non-linear layers, dataset and token-aware protocol adaptation, and system co-design for realistic MLaaS deployment.

For a comprehensive analysis, see survey (Li et al., 15 May 2025) and the referenced system papers.