Weight-Private ML Inference Protocols

Updated 14 December 2025

Weight-private ML inference is a set of protocols ensuring that a server's confidential model weights remain hidden while processing user inputs.
It leverages cryptographic techniques, PIR-inspired methods, and quantization to balance efficiency with robust privacy guarantees.
Practical implementations demonstrate significant speedups and minimal accuracy loss, driving advances in privacy-preserving MLaaS.

Weight-private machine learning inference refers to a class of protocols in which inference is performed on a model privately held by a server, such that the model's weights remain hidden from the user and the user's data remain hidden from the server, except for the final model output. This concept is central in privacy-preserving machine learning-as-a-service (MLaaS), where strong confidentiality guarantees for both the model intellectual property and sensitive user inputs are required. Technical approaches span information-theoretic, cryptographic (homomorphic encryption, secure multi-party computation, garbled circuits), and private information retrieval–inspired protocols, often leveraging quantization or architectural adaptations to improve efficiency and scalability.

1. Formal Models and Privacy Guarantees

Weight-private inference is characterized by a two-party protocol between a server holding a model parameter vector (often quantized to a finite alphabet $W$ ) and a user supplying an input $x$ to the model. The transcript of exchanged messages must guarantee:

Weight (Model) Privacy: The user should learn nothing about the model parameters $w$ except what is inevitably disclosed by the model output $f(\langle w, x \rangle)$ . This is quantified as $I(W; T) \leq \epsilon_S$ , where $T$ is the transcript.
Input (User) Privacy: The server should learn nothing about the user input $x$ except as required for computing the output. For linear inner-product–based inference, privacy is parameterized by the dimension $\ell$ of the subspace of $\mathbb{R}^n$ revealed via returned linear measurements; equivalently, $I(X; T) \leq \epsilon_U$ (Deng et al., 2023).

The canonical protocol form consists of (i) a query sent from server to user encoding obfuscated model information, (ii) a user reply derived from $x$ and the query, and (iii) recovery of the final output by the server.

2. PIR-Inspired and Information-Theoretic Protocols

For quantized models, especially those with weights in $\{\pm 1\}$ or a small finite set, schemes inspired by private information retrieval enable highly efficient, information-theoretically secure inference. Typical constructions include:

Coset Protocol: The server constructs a parity-check syndrome of the weight vector under a partition of coordinates, forming a coset; the user computes blockwise inner products, and the server reconstructs $\langle w, x \rangle$ from these. The critical parameters are $d = n-t$ bits of query leakage (publication cost), $\ell = t$ real-valued projections for user privacy, and the trade-off $I(W; Q) + \ell \ge n$ (Deng et al., 2023).
Random Key, Roots-of-Unity, and Joint Retrieval Variants: These extend PIR-like logic to alternative finite alphabets and simultaneous retrieval of multiple signals.

Such protocols ensure single-round, local-linear-operation inference with $O(n)$ computation and communication. They are optimal with respect to the privacy–accuracy trade-off, and straightforwardly generalize to extensions (e.g., field-size adaptation, multi-output queries) (Deng et al., 2023).

3. Cryptographic Protocols and Efficient Secure Inference

When model weights are not public, classical cryptographic primitives—such as homomorphic encryption (HE), secure multi-party computation (MPC), or garbled circuits—are used to conceal $w$ and $x$ during inference. Recent work seeks to mitigate the severe communication and computation bottlenecks associated with these techniques:

MPC for Deep Neural Networks and Transformers: Additive secret sharing, offline Beaver triples for matrix products, and carefully quantized approximate activations (e.g., replacing GeLU/softmax with polynomials or LUTs) enable practical secure inference on large models (Li et al., 2022, Lu et al., 3 Aug 2025). For BERT-base, secret-shared quantized protocols achieve up to $22\times$ speedup versus prior work by integrating layer-wise per-channel quantization, LUT-based nonlinearity, and secret sharing for all weights—guaranteeing weight privacy and accuracy within 4% of full-precision baselines (Lu et al., 3 Aug 2025).
Neural Architecture Adaptation: Protocol-aware network design (e.g., using binary/ternary weights, bitwise activations, bit-friendly pooling) dramatically reduces the cryptographic operation count. Automated neural architecture search (NAS) can optimize for both task accuracy and cryptographic cost by embedding cost models of underlying protocols directly into training objectives (Aggarwal et al., 2020).
Operator and Algorithmic Innovations: Special convolutional operator design (e.g., X-operator, depthwise or grouped convolutions) and Winograd minimal filtering reduce the number of expensive secure multiplies in vision tasks, resulting in $5\times$ – $30\times$ reductions in communication overhead (Ganesan et al., 2022).

4. Approximate Inference and Privacy Trade-Offs

Recent advances show that by relaxing exact reconstruction of $\langle w, x \rangle$ , strict information-theoretic lower bounds on privacy leakage can be bypassed. Introducing controlled approximation (e.g., via random coset noise vectors) reduces $I(W; Q)$ at fixed communication cost and user leakage:

Approximate PIR Protocol: With error scaling as $O(\|x\|_2 \sqrt{h/n})$ (for $h$ errors injected into the codeword), server-privacy leakage can be reduced by $\log|\Gamma|$ bits ( $\Gamma$ is the set of randomizations) (Deng et al., 2023). For subpolynomial user leakage and approximation error, this achieves near-optimal privacy for linear decoders.

This approach provides a mechanism to trade negligible output distortion for substantial gains in model privacy, particularly when only partial accuracy is required by downstream applications.

5. Partially Oblivious Inference and Controlled Weight Leakage

In the context of expensive HE-based protocols, allowing intentional leakage of a subset $\mathcal{L}$ of model weights (so-called "partially oblivious inference") yields a tunable trade-off between inference speed and weight privacy:

Trade-Off Mechanism: With leakage fraction $\lambda = |\mathcal{L}|/|\mathcal{W}|$ , expensive ciphertext-ciphertext operations are replaced by faster plaintext-ciphertext operations for public weights. Empirically, disclosing up to 80% of the weights can reduce inference runtime by 4 $\times$ while enabling negligible (≤0.1%) adversarial advantage in model reconstruction (Rizomiliotis et al., 2022).

Best practices involve validation-based calibration of $\lambda$ and adapting leakage granularity to model and packing structure, while tracking adversarial reconstruction ability as the core privacy metric.

6. Practical Implementations and Experimental Results

State-of-the-art frameworks instantiate these protocols across domains and models:

Framework / Approach	Model Support	Weight Sharing	Notable Efficiency
PIR-based (Deng et al., 2023)	Quantized (e.g., linear)	None (full privacy)	O(n) comm/comp., single round
SOTERIA (Aggarwal et al., 2020)	ConvNets, ternary	Fully private (GC)	2–3 $\times$ lower comm. than XONN/GC
MPCFormer (Li et al., 2022)	BERT, Transformers	Fully private (MPC)	5.3 $\times$ faster, ~1% acc. loss
Quantized BERT (Lu et al., 3 Aug 2025)	BERT-base	Fully private (q-MPC)	8–22 $\times$ faster vs past SOTA
Partially Oblivious (Rizomiliotis et al., 2022)	CNN (CIFAR-10)	Fractional, tunable	4 $\times$ CT $\times$ CT reduction at $\leq 0.1\%$ adv. gain

Empirical results demonstrate that:

End-to-end, quantized and protocol-aware models consistently achieve order-of-magnitude speedups over generic cryptographic baselines, with accuracy drops in the 0.5–4% range.
For sum-product networks, secure secret-sharing–based learning and inference scale to dozens of parties at low overhead and negligible (<0.2%) numerical error (Althaus et al., 2021).
For large transformers, secret-shared, quantized weight protocols achieve online times of 1.04–2.14 s (sequence length 8–32 tokens) on 12-layer BERT-base compared to 8–16.7 s for previous MPC approaches (Lu et al., 3 Aug 2025).

7. Open Problems and Research Directions

Current work leaves several active areas of research:

Multi-layer/private-nonlinearity protocols: Generalizing information-theoretic inner-product retrieval to nonlinear activations with privacy guarantees remains challenging (Deng et al., 2023).
Approximate retrieval with dynamical composition: Balancing error, privacy, and amortized leakage over multiple queries has not reached final theoretical or practical resolution (Deng et al., 2023).
Scalable design for multi-party and federated inference: Extending protocol-objective co-design to new model classes and communication architectures.
Fine-grained, context-aware leakage control: Automating and provably quantifying acceptable weight or input disclosure for tractable real-world deployments (Rizomiliotis et al., 2022).

These open directions suggest that practical, weight-private inference in large-scale, high-value domains will continue to benefit from joint advances in cryptography, information theory, quantization, and ML model structure awareness.