Privacy-Preserving Cloud Inference
- Privacy-preserving cloud inference is the use of cryptographic techniques and secure architectures to perform ML inference without exposing sensitive data.
- It leverages homomorphic encryption, MPC, and TEEs to ensure data and model confidentiality while optimizing computation, latency, and scalability.
- Practical implementations balance security, performance, and scalability through advanced packing strategies, edge-cloud collaboration, and formal security proofs.
Privacy-preserving cloud inference refers to protocols, cryptographic systems, and hybrid architectures that enable machine learning inference to be performed on cloud infrastructure without exposing sensitive input data (and, often, model parameters) to potentially untrusted cloud operators. The primary aim is to guarantee the confidentiality of user inputs and, in some architectures, protect the privacy or intellectual property of the deployed models, even under strong adversarial models. Research in this discipline rigorously addresses the trade-offs among security, computation cost, system scalability, and inference accuracy—engaging a wide spectrum of threat models and deploying techniques such as homomorphic encryption (HE), secure multi-party computation (MPC), trusted execution environments (TEEs), and hybrid edge–cloud collaboration.
1. Architectures and Threat Models
A central concern in privacy-preserving cloud inference is the adversarial relationship among the involved parties: data owner (client), cloud server(s), and model provider (optional third party). Modern designs formalize a variety of threat models:
- Three-Party Setting with Mutual Distrust: The model provider (MP) wishes to protect proprietary model weights; the data provider (DP) wishes to keep input and outputs confidential; and the cloud, split into a TEE (trusted) enclave and an untrusted resource-rich execution environment (REE), is assumed to have only the TEE as fully trusted, with the REE and other parties regarded as potentially adversarial (Liu et al., 2022).
- Multi-Party Cloud Computation: Some frameworks utilize multiple non-colluding cloud servers (typically 2–4), and assume semi-honest adversaries controlling at most one party. These settings protect sensitive graph data and GNN parameters (Wang et al., 2022, Chen et al., 3 Nov 2025).
- Edge–Cloud Collaborative Split: Joint edge–cloud inference splits the model such that lower layers run on resource-constrained but private edge devices, with the cloud handling heavier computation on representations, which are privacy-processed before batching to the cloud (Osia et al., 2017, Wang et al., 2022, Song et al., 30 Jan 2026).
- Specialized Models and Data Types: Architectures are adapted to tree models (decision trees, Hoeffding trees), transformers, GNNs, and LLMs (Wang et al., 2022, Yuan et al., 2024, Chen et al., 2022, Luo et al., 2024, Huang et al., 27 Feb 2026, Chen et al., 3 Nov 2025, Yu et al., 19 Mar 2026).
System security goals typically formalize: input confidentiality, model/parameter confidentiality, output privacy, integrity/authenticity of results, and provable resistance to various privacy attacks (e.g., reconstruction, secondary inference, membership inference).
2. Cryptographic and Algorithmic Fundamentals
Privacy-preserving inference schemes rely on rigorous cryptographic primitives, often combining them for practical, scalable performance:
- Leveled and Fully Homomorphic Encryption (LHE/FHE): Frameworks employ schemes such as CKKS (for real-number approximate computations) and BFV (integer/ring-based) to enable encrypted evaluation of arithmetic circuits—supporting most linear ML computations in ciphertext (Liu et al., 2022, Bollikonda, 28 Oct 2025, Chen et al., 2022).
- Packing Strategies: To optimize throughput and reduce per-inference latency, ciphertext packing (SIMD-style schemes, batching, cross-channel/filter packing) is used, exploiting ring/ciphertext structures to process many instances in parallel (Liu et al., 2022, Bollikonda, 28 Oct 2025).
- MPC, Secret Sharing, and Garbled Circuits: Many frameworks implement additive secret sharing protocols for linear computations, with more complex operations (e.g., non-linear activations) relying on precomputed Beaver triples (MPC), garbled circuits (GC), and custom OT protocols for minimal communication and round complexity (Wang et al., 2022, Dehkordi et al., 2024, Chen et al., 3 Nov 2025).
- TEE Integration: Trusted enclaves (e.g., Intel SGX) serve as secure keystores, perform essential decryption and re-encryption, and run small sensitive code footprints, while offloading intensive computation to untrusted but performant cloud server segments. These TEEs do not eliminate all possible side-channels, but their code minimalism facilitates mitigation (Liu et al., 2022, Wang et al., 2022).
- Verifiability: Some protocols provide computationally lightweight verifiability, such as algebraic tags and correctness proofs that catch cloud misbehavior with high probability (e.g., with failure probability ≤1/|Z| per operation) (Lu et al., 18 Aug 2025).
3. Inference Protocols Across Model Types
Privacy-preserving inference protocols are instantiated for a variety of machine learning models:
- Convolutional and Dense Neural Networks: HE-based approaches encode model weights and user inputs, perform all linear layers fully encrypted, and handle (or approximate) non-linearities via polynomial approximation or interactive protocols. Recent work synergistically combines HE with TEE to minimize enclave dependence and achieve high throughput for CNNs, supporting efficient batching and generic packing (Liu et al., 2022, Bollikonda, 28 Oct 2025, Xie et al., 2021).
- Decision Trees and Hoeffding Trees: Advanced schemes such as OnePath securely traverse only the prediction path using functional encryption (FE), secret sharing, and lightweight symmetric encryption to yield microsecond-level latencies and simulation-based security for both models and data (Yuan et al., 2024). EnclaveTree focuses on side-channel-resilient, matrix-based oblivious inference within TEEs (Wang et al., 2022).
- Graph Neural Networks: SecGNN implements privacy-preserving GNN inference via 2-out-of-2 additive secret sharing and custom MPC protocols for secure array access, ReLU, normalization, and softmax. Panther advances this with four-party replicated secret sharing, randomized neighbor padding, and asynchronous protocols to reduce both latency and cost, achieving over 80% time and 50% bandwidth reduction relative to prior art (Wang et al., 2022, Chen et al., 3 Nov 2025).
- Transformers and LLMs: Protocols such as THE-X and CENTAUR adapt polynomial approximation, SMPC, and permutation-based mechanisms to transformers, balancing privacy, performance, and efficiency across complex non-linearities and normalization layers (Chen et al., 2022, Luo et al., 2024). Talaria partitions LLM pipelines using confidential virtual machines (CVMs) and "Reversible Masked Outsourcing" to maintain both input/output and weight confidentiality (Huang et al., 27 Feb 2026). PlanTwin abstracts local context into digital twins for planning over private environments (Yu et al., 19 Mar 2026).
- Cloud-Edge Collaborative and Federated Settings: CIS and similar frameworks add tailored differential privacy noise to intermediate layer feature maps at the device–cloud boundary, defend against advanced reconstruction attacks, and dynamically partition the model for latency-optimized collaborative inference (Wang et al., 2022, Punniyamoorthy et al., 11 Dec 2025).
4. Performance, Scalability, and Deployment
System-level optimization is critical given the high computational cost of cryptographic protocols:
- Packing and Parallelization: Techniques such as ciphertext SIMD, block parallelism, operator fusion, and modular switching are leveraged for large throughput and low latency, as observed in containerized, Kubernetes-orchestrated HE pipelines (Bollikonda, 28 Oct 2025, Liu et al., 2022). Elastic scaling via autoscaling and microservices provides near-linear cluster throughput.
- Batching Strategies: Amortized per-inference latency decreases sharply with batch size due to packing (e.g., for CNN 3–2, latency falls from 2.86 s at n=16 to 1.15 s at n=512) (Liu et al., 2022).
- Resource Scheduling: Hierarchical and distributed inference protocols offload heavy arithmetic to cloud clusters while reducing inter-party communication (e.g., via offline/online separation and additive secret shares) (Dehkordi et al., 2024).
- Accuracy and Utility: HE- and MPC-based solutions preserve test accuracy to within fractions of a percent compared to plaintext baselines (e.g., Panther ≤0.6% drop for GNNs, OnePath negligible for DTs) (Chen et al., 3 Nov 2025, Yuan et al., 2024). Approximation of non-polynomial functions (GELU, softmax, LayerNorm) in transformers yields <1.5% mean drop in GLUE/NER tasks (Chen et al., 2022, Luo et al., 2024).
- Efficiency Benchmarks: State-of-the-art GNN inference under Panther reduces cloud cost by an estimated 59% (Google Cloud on-demand) and cuts runtime by up to 82.80% versus previous MPC paradigms (Chen et al., 3 Nov 2025). For privacy-preserving CNN convolutional layers with verifiability, speedups of 26–87× are recorded relative to client-only execution (Lu et al., 18 Aug 2025).
- Limitations: Deep networks involving frequent bootstrapping, high communication, or complex non-linear activation remain challenging to execute in real-time, especially on general-purpose CPUs (Bollikonda, 28 Oct 2025).
5. Security Analysis and Formal Guarantees
State-of-the-art frameworks establish rigorous theoretical guarantees:
- Data and Model Confidentiality: HE schemes provide IND-CPA security (e.g., RLWE-based CKKS, BFV), while MPC and secret-sharing architectures offer unconditional secrecy against up to a threshold of colluding servers (Liu et al., 2022, Wang et al., 2022).
- Verifiability and Soundness: Algebraic tags and correctness proofs (e.g., client-side tags with random linear combinations) ensure that incorrect server-side inference is detected with probability at least 1–1/|Z| (where |Z| is the masking field size) (Lu et al., 18 Aug 2025).
- Resistance to Inference and Reconstruction Attacks: Protocols are evaluated against white-box and black-box attack models, including attribute inference, membership inference, and deep generative reconstructions. Layerwise DP mechanisms (e.g., per-channel Laplace or Gaussian) substantially degrade attack success rates. In federated and distributed settings, privacy budgets are enforced per-request and attested via zk-SNARKs (Punniyamoorthy et al., 11 Dec 2025, Wang et al., 2022).
- Side-Channel and Compositional Risks: Formal security models often assume no leakage outside the cryptographic protocol (excluding TEE side-channels, or assuming mitigated enclaves). Matrix-based, data-oblivious algorithms further mask access patterns in TEE-based solutions (Wang et al., 2022).
- Simulation-based Proofs: Many schemes provide simulation-based indistinguishability proofs in the semi-honest model, arguing that no coalition of adversaries can distinguish protocol views from simulated ones, except for minimal (often path-index or output-only) leakage (Yuan et al., 2024).
6. Practical Trade-offs and Open Directions
Major challenges and considerations in privacy-preserving cloud inference include:
- Usability and Composability: Designs must accommodate arbitrary model architectures, varying trust structures, and diverse privacy policies across applications (e.g., medical analytics, surveillance, NLP, planning) (Yu et al., 19 Mar 2026).
- Latency–Bandwidth–Cost Trade-offs: Increased party count (MPC), heavy cryptographic primitives, or complex noise calibration raise communication and computational costs. Current best practice balances offloading, packing, and trusted enclave usage to minimize end-to-end latency and operational expenditure (Chen et al., 3 Nov 2025, Bollikonda, 28 Oct 2025).
- Scalability and Federated Settings: Hierarchical partitioning and asynchronous execution reduce central bottlenecks for distributed and federated learning and inference (Dehkordi et al., 2024, Punniyamoorthy et al., 11 Dec 2025).
- Privacy-Utility Frontier: Mechanisms such as feature-rank-guided DP allocation or autoencoder/CAM plug-ins define and optimize for specific trade-offs between primary inference accuracy and resistance to inference attacks (Wang et al., 2022, Higgins et al., 28 Feb 2025).
- Future Research: Topics of active inquiry include robust malicious security (beyond the semi-honest model), hybrid protocols combining HE, MPC, and TEE for deep models, specialized hardware acceleration for lattice arithmetic, formal utility bounds for DP, and dynamic policy-driven governance (Luo et al., 2024, Punniyamoorthy et al., 11 Dec 2025, Bollikonda, 28 Oct 2025). Cross-domain methodologies such as information bottleneck–based transformations and digital twin abstractions further expand the utility-privacy trade space (Song et al., 30 Jan 2026, Yu et al., 19 Mar 2026).