Vision Transformers with Homomorphic Encryption

Updated 30 November 2025

Vision Transformers with Homomorphic Encryption are models that integrate cryptographic protocols into transformer architectures to securely process sensitive visual data.
They employ the CKKS scheme to pack real vectors, enabling efficient linear and polynomial operations while approximating non-linear activation functions.
Empirical studies reveal significant communication efficiency, minimal accuracy loss, and enhanced resistance to inversion attacks via secure [CLS] token aggregation.

Vision Transformers (ViT) with Homomorphic Encryption (HE) integrates state-of-the-art transformer-based computer vision models with cryptographic protocols enabling privacy-preserving computation. The principal technological goal is to enable secure model training and inference on sensitive visual data—such as medical images—by performing the necessary linear and approximate nonlinear operations directly on encrypted feature representations or, in some instances, on encrypted raw images. The primary HE scheme leveraged across recent work is CKKS, which allows efficient approximate arithmetic over real vectors while providing semantic security under RLWE assumptions. Major research lines address federated learning (FL), fine-tuning, and secure inference under the constraints and trade-offs introduced by homomorphic evaluation of the full ViT computational graph.

1. Enabling ViT under Homomorphic Encryption

Standard ViT architectures incorporate linear projections for patch embeddings, multi-head self-attention (MHSA), MLP blocks with nonlinear activations (commonly GELU), positional encoding, LayerNorm, and residual pathways. HE, however, only supports addition and (polynomial) multiplication operations. This restriction necessitates either (a) extraction of ViT’s compact, discriminative representations (notably the 768-dimensional [CLS] token) for HE-protected aggregation and inference or (b) conversion of the entire ViT pipeline to an HE-compatible format via polynomial approximations of non-polynomial operations.

Approaches include:

Compact, communication-efficient feature aggregation and inference by encrypting only the [CLS] token for federated training (Amin et al., 26 Nov 2025).
Full (or partial) ViT execution “under encryption” via polynomial surrogate models for softmax attention, LayerNorm, and GELU activations (Zimerman et al., 2023, Panzade et al., 17 Jan 2024).
Privacy-preserving fine-tuning of ViT/DEiT heads on encrypted data using cloud-offloaded FHE computation (Panzade et al., 17 Jan 2024).

The CKKS scheme encodes each $d$ -dimensional real vector as a single packed ciphertext, supporting SIMD linear algebra. Plaintext manipulations on slot-packed data facilitate efficient evaluation of ViT linear layers.

2. CKKS Homomorphic Encryption Framework and Parameters

CKKS is a leveled (optionally bootstrapped) approximate-HE scheme for real/complex numbers, defined as:

Polynomial modulus degree $N$ (e.g., $2^{13}$ – $2^{16}$ ), providing both parameter trade-offs and resistance to lattice attacks.
Coefficient modulus chain $\{q_0 \gg q_1 \gg \dots \gg q_L\}$ , determining the permissible multiplicative depth.
Scale $\Delta$ (typically $2^{40}$ – $2^{42}$ ), controlling fixed-point quantization.
Public and secret keys as in:

$\begin{align*} sk &\in R_q, \ pk_0 &= -a\cdot sk + e + \Delta \cdot m, \quad pk_1 = a, \quad a \sim U(R_q),\, e \sim \text{DiscreteGaussian}. \end{align*}$

Ciphertexts as tuples in $R_q^2$ (or more, pre-relinearization).

Vector encoding and encryption, for ViT [CLS] extraction, involves packing the 768-D vector into a CKKS plaintext polynomial, scaling and noise considerations ensuring post-quantum 128-bit security (N=8192, (Amin et al., 26 Nov 2025); N= $2^{16}$ , (Panzade et al., 17 Jan 2024)).

Noise management, depth-counting, and selective bootstrapping are critical for multi-layer evaluation. In MedBlindTuner (Panzade et al., 17 Jan 2024), bootstraps are triggered only upon threshold noise growth, reducing FHE runtime by up to 2x.

3. Secure ViT Workflows: Architectures and Model Conversion

Table: ViT–HE Integration Approaches

Method/Reference	HE Application	ViT Block Modifications
(Amin et al., 26 Nov 2025)	FL: CLS encryption/aggregation	Standard ViT, extract/encrypt [CLS] token
(Panzade et al., 17 Jan 2024)	FHE fine-tuning / end-to-end inference	DEiT w/ reduced dim; polynomial act.
(Zimerman et al., 2023)	Full polynomial ViT for HE	σ-attention; polynomial LayerNorm; BN

Extracting the ViT [CLS] token for aggregation—rather than gradients—reduces transmission by 30 $\times$ compared to gradient encryption ( $9{,}794.1\ \rm{KB}$ – $326.4\ \rm{KB}$ per sample) (Amin et al., 26 Nov 2025). For full HE-compatible execution, as in (Zimerman et al., 2023), each non-polynomial module must be replaced: softmax via “σ-attention” (pointwise polynomial on QK $^\top$ ), GELU/ReLU by low-degree polynomial approximations over input-limited intervals, LayerNorm by a polynomial inverse-sqrt (with variance-range compression regularizer). This ensures the entire circuit is polynomial prior to encryption. BatchNorm is often preferred to LayerNorm for normalization stability under HE, further recovering lost accuracy (within 2% of baseline).

Model compression—reducing embedding size $d$ , heads $h$ , and depth $L$ —is a central optimization. MedBlindTuner (Panzade et al., 17 Jan 2024) sets $d=192$ , $h=3$ , $L=6$ to balance noise, ciphertext size, and polynomial depth.

4. Federated Learning, Secure Aggregation, and End-to-End Inference

In multi-institutional federated learning for histopathology classification (Amin et al., 26 Nov 2025), the client-side protocol is:

Local ViT inference on each sample: compute $z_j = M_i.\mathrm{CLS}(\mathbf{x}_j)$ .
HE-encrypt $z_j$ via CKKS to $c_j^{(i)}$ .
Transmit $E_i = \{c_j^{(i)}\}_{j=1}^{n_i}$ to the server.

The server aggregates via slot-wise CKKS addition and plaintext scaling, producing averaged ciphertexts $\bar{c}_j$ . CKKS homomorphically supports:

$\bar{c}_j = \frac{1}{N}\sum_{i=1}^{N} c_j^{(i)}$

This approach ensures that only compact, informative representations are ever exposed, never raw gradients or data. Model inversion attacks on shared gradients achieve PSNR 52.26 dB, SSIM 0.999, LPIPS $\approx 0$ (imperceptible error), but encrypted [CLS] tokens leak no reconstructable information: PSNR $< 20$ dB (random-noise level), secured under RLWE (Amin et al., 26 Nov 2025).

For encrypted inference, server-side operations on [CLS] ciphertexts involve plaintext linear classification (matrix-vector product), a degree-2 polynomial softmax approximation, and final prediction—all fully homomorphic (66 ms/image, 411 MB RAM per batch), yielding a $\sim$ 36 $\times$ speedup over prior encrypted-gradient pipelines.

5. Accuracy, Communication, and Resource Trade-offs

Empirical findings across implementations highlight clear trade-offs:

(Amin et al., 26 Nov 2025): Unencrypted [CLS] accuracy 96.12%; Encrypted [CLS] accuracy 90.02%; Encrypted gradients drop to 85.35%. Communication is reduced 30 $\times$ versus gradient-based methods.
(Panzade et al., 17 Jan 2024): On five MedMNIST2D datasets, encrypted fine-tuning incurs $\leq 2\%$ accuracy drop versus cleartext, with per-batch ciphertext footprint 10 MB (2048 features at 2 KB each), and fine-tuning times 20–45 min (30 $\times$ slower than unencrypted).
(Zimerman et al., 2023): Polynomial ViTs on CIFAR-100 attain 70.8% accuracy versus 73.4% for the original; Swin on Tiny-ImageNet, 58.9% (polynomial) vs. 59.4% (original).

Most accuracy loss is attributable to polynomial approximation error and/or model downsizing to fit HE noise and depth budgets. Range-minimization (limiting pre-polynomial input intervals) and substituting BatchNorm for LayerNorm critically preserve accuracy in HE contexts.

6. Technical Challenges and Optimization Strategies

Circuit Depth: Bootstrapping in CKKS is expensive; pipelines must minimize multiplicative depth. Each polynomial nonlinearity incurs as many multiplications as its degree, necessitating both low-degree approximations (e.g., degree-3 for softmax, degree-7 for GELU, degree-8 for inverse-sqrt) and minimizing layer count.
Noise Growth: Addition is noise-stable; multiplication increases noise roughly as $q_i \cdot noise(c_1)\cdot noise(c_2)/q_{i+1}$ . Lazy bootstrapping, slot packing (SIMD), and depth budgeting are standard mitigations (Panzade et al., 17 Jan 2024).
Activation/Normalization: Non-polynomial activations and normalizations must be replaced. “σ-attention”—applying, e.g., GELU or ReLU polynomials directly to QK $^\top$ —is an effective softmax surrogate (Zimerman et al., 2023). Range forcing and variance regularization make low-degree polynomial fit possible.
Model Partitioning/Freezing: For feasible depth budgeting, parts of the ViT may be “frozen” or run in plaintext; e.g., only classification heads are fine-tuned on ciphertext (Panzade et al., 17 Jan 2024). Other strategies include splitting ViT into FHE-amenable stages or leveraging hybrid HE/SMC protocols.

7. Outlook and Limitations

Research confirms that integrating ViT with HE is tractable for privacy-preserving medical imaging, with substantial communication and privacy benefits. Key limitations remain:

Scaling to full ViT-Base ( $\sim$ 86M parameters) or handling >12 multiplicative layers will generally require more efficient or frequent bootstrapping, further polynomial/memory optimizations, or hybrid secure protocols (Zimerman et al., 2023).
Trade-offs between compactness, encryption precision, runtime, and final accuracy are fundamental; e.g., increasing polynomial degrees or multiplicative depth yields diminishing returns past error tolerance.
Model input-interval regularization and manual coefficient tuning for polynomial approximations are still empirical and model-specific.

Current frameworks demonstrate that by extracting a 768-D [CLS] embedding, encrypting under CKKS, and performing subsequent aggregation and polynomial inference, it is possible to enable secure, low-overhead federated learning with minimal accuracy sacrifice (≤6.1% for [CLS] classification), strong resistance to inversion attacks, and orders-of-magnitude reduction in network bandwidth per sample (Amin et al., 26 Nov 2025).

Ongoing work explores even more compact polynomial transformer circuits, direct polynomial softmax approximations, rational-function encodings, automatic bootstrapping, and application to large or multimodal ViTs (Zimerman et al., 2023, Panzade et al., 17 Jan 2024, Amin et al., 26 Nov 2025).