Papers
Topics
Authors
Recent
2000 character limit reached

Weight-Private ML Inference Protocols

Updated 14 December 2025
  • Weight-private ML inference is a set of protocols ensuring that a server's confidential model weights remain hidden while processing user inputs.
  • It leverages cryptographic techniques, PIR-inspired methods, and quantization to balance efficiency with robust privacy guarantees.
  • Practical implementations demonstrate significant speedups and minimal accuracy loss, driving advances in privacy-preserving MLaaS.

Weight-private machine learning inference refers to a class of protocols in which inference is performed on a model privately held by a server, such that the model's weights remain hidden from the user and the user's data remain hidden from the server, except for the final model output. This concept is central in privacy-preserving machine learning-as-a-service (MLaaS), where strong confidentiality guarantees for both the model intellectual property and sensitive user inputs are required. Technical approaches span information-theoretic, cryptographic (homomorphic encryption, secure multi-party computation, garbled circuits), and private information retrieval–inspired protocols, often leveraging quantization or architectural adaptations to improve efficiency and scalability.

1. Formal Models and Privacy Guarantees

Weight-private inference is characterized by a two-party protocol between a server holding a model parameter vector (often quantized to a finite alphabet WW) and a user supplying an input xx to the model. The transcript of exchanged messages must guarantee:

  • Weight (Model) Privacy: The user should learn nothing about the model parameters ww except what is inevitably disclosed by the model output f(w,x)f(\langle w, x \rangle). This is quantified as I(W;T)ϵSI(W; T) \leq \epsilon_S, where TT is the transcript.
  • Input (User) Privacy: The server should learn nothing about the user input xx except as required for computing the output. For linear inner-product–based inference, privacy is parameterized by the dimension \ell of the subspace of Rn\mathbb{R}^n revealed via returned linear measurements; equivalently, I(X;T)ϵUI(X; T) \leq \epsilon_U (Deng et al., 2023).

The canonical protocol form consists of (i) a query sent from server to user encoding obfuscated model information, (ii) a user reply derived from xx and the query, and (iii) recovery of the final output by the server.

2. PIR-Inspired and Information-Theoretic Protocols

For quantized models, especially those with weights in {±1}\{\pm 1\} or a small finite set, schemes inspired by private information retrieval enable highly efficient, information-theoretically secure inference. Typical constructions include:

  • Coset Protocol: The server constructs a parity-check syndrome of the weight vector under a partition of coordinates, forming a coset; the user computes blockwise inner products, and the server reconstructs w,x\langle w, x \rangle from these. The critical parameters are d=ntd = n-t bits of query leakage (publication cost), =t\ell = t real-valued projections for user privacy, and the trade-off I(W;Q)+nI(W; Q) + \ell \ge n (Deng et al., 2023).
  • Random Key, Roots-of-Unity, and Joint Retrieval Variants: These extend PIR-like logic to alternative finite alphabets and simultaneous retrieval of multiple signals.

Such protocols ensure single-round, local-linear-operation inference with O(n)O(n) computation and communication. They are optimal with respect to the privacy–accuracy trade-off, and straightforwardly generalize to extensions (e.g., field-size adaptation, multi-output queries) (Deng et al., 2023).

3. Cryptographic Protocols and Efficient Secure Inference

When model weights are not public, classical cryptographic primitives—such as homomorphic encryption (HE), secure multi-party computation (MPC), or garbled circuits—are used to conceal ww and xx during inference. Recent work seeks to mitigate the severe communication and computation bottlenecks associated with these techniques:

  • MPC for Deep Neural Networks and Transformers: Additive secret sharing, offline Beaver triples for matrix products, and carefully quantized approximate activations (e.g., replacing GeLU/softmax with polynomials or LUTs) enable practical secure inference on large models (Li et al., 2022, Lu et al., 3 Aug 2025). For BERT-base, secret-shared quantized protocols achieve up to 22×22\times speedup versus prior work by integrating layer-wise per-channel quantization, LUT-based nonlinearity, and secret sharing for all weights—guaranteeing weight privacy and accuracy within 4% of full-precision baselines (Lu et al., 3 Aug 2025).
  • Neural Architecture Adaptation: Protocol-aware network design (e.g., using binary/ternary weights, bitwise activations, bit-friendly pooling) dramatically reduces the cryptographic operation count. Automated neural architecture search (NAS) can optimize for both task accuracy and cryptographic cost by embedding cost models of underlying protocols directly into training objectives (Aggarwal et al., 2020).
  • Operator and Algorithmic Innovations: Special convolutional operator design (e.g., X-operator, depthwise or grouped convolutions) and Winograd minimal filtering reduce the number of expensive secure multiplies in vision tasks, resulting in 5×5\times30×30\times reductions in communication overhead (Ganesan et al., 2022).

4. Approximate Inference and Privacy Trade-Offs

Recent advances show that by relaxing exact reconstruction of w,x\langle w, x \rangle, strict information-theoretic lower bounds on privacy leakage can be bypassed. Introducing controlled approximation (e.g., via random coset noise vectors) reduces I(W;Q)I(W; Q) at fixed communication cost and user leakage:

  • Approximate PIR Protocol: With error scaling as O(x2h/n)O(\|x\|_2 \sqrt{h/n}) (for hh errors injected into the codeword), server-privacy leakage can be reduced by logΓ\log|\Gamma| bits (Γ\Gamma is the set of randomizations) (Deng et al., 2023). For subpolynomial user leakage and approximation error, this achieves near-optimal privacy for linear decoders.

This approach provides a mechanism to trade negligible output distortion for substantial gains in model privacy, particularly when only partial accuracy is required by downstream applications.

5. Partially Oblivious Inference and Controlled Weight Leakage

In the context of expensive HE-based protocols, allowing intentional leakage of a subset L\mathcal{L} of model weights (so-called "partially oblivious inference") yields a tunable trade-off between inference speed and weight privacy:

  • Trade-Off Mechanism: With leakage fraction λ=L/W\lambda = |\mathcal{L}|/|\mathcal{W}|, expensive ciphertext-ciphertext operations are replaced by faster plaintext-ciphertext operations for public weights. Empirically, disclosing up to 80% of the weights can reduce inference runtime by 4×\times while enabling negligible (≤0.1%) adversarial advantage in model reconstruction (Rizomiliotis et al., 2022).

Best practices involve validation-based calibration of λ\lambda and adapting leakage granularity to model and packing structure, while tracking adversarial reconstruction ability as the core privacy metric.

6. Practical Implementations and Experimental Results

State-of-the-art frameworks instantiate these protocols across domains and models:

Framework / Approach Model Support Weight Sharing Notable Efficiency
PIR-based (Deng et al., 2023) Quantized (e.g., linear) None (full privacy) O(n) comm/comp., single round
SOTERIA (Aggarwal et al., 2020) ConvNets, ternary Fully private (GC) 2–3×\times lower comm. than XONN/GC
MPCFormer (Li et al., 2022) BERT, Transformers Fully private (MPC) 5.3×\times faster, ~1% acc. loss
Quantized BERT (Lu et al., 3 Aug 2025) BERT-base Fully private (q-MPC) 8–22×\times faster vs past SOTA
Partially Oblivious (Rizomiliotis et al., 2022) CNN (CIFAR-10) Fractional, tunable 4×\times CT×\timesCT reduction at 0.1%\leq 0.1\% adv. gain

Empirical results demonstrate that:

  • End-to-end, quantized and protocol-aware models consistently achieve order-of-magnitude speedups over generic cryptographic baselines, with accuracy drops in the 0.5–4% range.
  • For sum-product networks, secure secret-sharing–based learning and inference scale to dozens of parties at low overhead and negligible (<0.2%) numerical error (Althaus et al., 2021).
  • For large transformers, secret-shared, quantized weight protocols achieve online times of 1.04–2.14 s (sequence length 8–32 tokens) on 12-layer BERT-base compared to 8–16.7 s for previous MPC approaches (Lu et al., 3 Aug 2025).

7. Open Problems and Research Directions

Current work leaves several active areas of research:

  • Multi-layer/private-nonlinearity protocols: Generalizing information-theoretic inner-product retrieval to nonlinear activations with privacy guarantees remains challenging (Deng et al., 2023).
  • Approximate retrieval with dynamical composition: Balancing error, privacy, and amortized leakage over multiple queries has not reached final theoretical or practical resolution (Deng et al., 2023).
  • Scalable design for multi-party and federated inference: Extending protocol-objective co-design to new model classes and communication architectures.
  • Fine-grained, context-aware leakage control: Automating and provably quantifying acceptable weight or input disclosure for tractable real-world deployments (Rizomiliotis et al., 2022).

These open directions suggest that practical, weight-private inference in large-scale, high-value domains will continue to benefit from joint advances in cryptography, information theory, quantization, and ML model structure awareness.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Weight-Private ML Inference.