Papers
Topics
Authors
Recent
2000 character limit reached

Secure k-ish NN for Sensitive Queries

Updated 8 December 2025
  • The paper introduces a sensitive query classifier that uses homomorphic encryption and k-ish NN relaxation to enable secure, scalable query classification.
  • It employs a double-blinded coin-toss primitive to efficiently estimate statistical moments, facilitating encrypted distance computations without revealing sensitive data.
  • Experiments on the Wisconsin Breast Cancer Dataset show a slight accuracy trade-off (F1 ≈ 0.98) with significant gains in speed and communication efficiency.

A sensitive query classifier provides privacy-preserving classification for queries on proprietary datasets, where the client wishes to classify a query point against a database held by a server, without either party exposing their respective data. The Secure k-ish Nearest Neighbors ("k-ish NN") classifier (Shaul et al., 2018) achieves this using homomorphic encryption (HE) and algorithmic relaxations that maintain accuracy while enabling highly scalable, parallel, and communication-efficient deployment.

1. Problem Formulation and Security Constraints

Consider a server holding a database S={xiFpdi=1n}S = \{ x_i \in \mathbb{F}_p^d \mid i = 1 \ldots n \} with binary class labels class(xi){0,1}\mathit{class}(x_i) \in \{0, 1\}, and a client holding a query point qFpdq \in \mathbb{F}_p^d. The conventional kNN classifier assigns qq the majority label among the kk nearest points in SS: $\mathrm{class}_{k\textsf{NN}}(q) = \mathrm{maj} \bigl\{ \mathit{class}(x_i) \ \big| \ \text{dist}(q, x_i) \ \text{is among the %%%%6%%%% smallest} \bigr\}$ The sensitive query scenario mandates that (a) the client learns only the classification result, no information about SS, and (b) the server gains no information about qq nor about any intermediate decrypted values. These properties are enforced via an additively or leveled-fully homomorphic encryption scheme, providing IND-CPA security and the necessary operations on encrypted data.

2. k-ish Nearest Neighbors Relaxation

The core methodological innovation is the relaxation of exact kk-nearest neighbors to a probabilistic "k-ish" selection. Instead of always returning the majority over the strict kk nearest, the classifier computes a random κ\kappa such that

Pr[κ[(1δ)k,(1+δ)k]]1ε\Pr \bigl[ \kappa \in [(1-\delta)k, (1+\delta)k] \bigr] \ge 1-\varepsilon

for tunably small δ,ε>0\delta, \varepsilon > 0. The empirical distance distribution DS,q={di=qxi}D_{S, q} = \{ d_i = \| q - x_i \| \} governs the statistical properties underlying the choice of threshold. If DS,qD_{S, q} is approximately Gaussian, set

T=μ+Φ1(k/n)σT = \mu + \Phi^{-1}(k/n) \sigma

where μ,σ\mu, \sigma are mean and standard deviation, and Φ1\Phi^{-1} is the inverse CDF of the normal distribution. Then, the expected number of points with di<Td_i < T is kk. The probability distribution over possible κ\kappa follows: Pr[κ=k]Pr(#{idi<T}=k)=u=0n(nk)pu(1p)nk\Pr[\kappa = k'] \approx \Pr \bigl( \# \{ i \mid d_i < T \} = k' \bigr ) = \sum_{u=0}^n \binom{n}{k'} p^u (1-p)^{n-k'} with p=Φ((Tμ)/σ)p = \Phi((T-\mu)/\sigma). The deviation probability is bounded: Pr[κk>δk]2exp(Ω(kδ2))+O(nsT)\Pr[|\kappa - k| > \delta k ] \le 2\exp(-\Omega(k \delta^2)) + O(n s T) where ss is the statistical distance between the empirical and Gaussian models.

3. Double-Blinded Coin-Toss Primitive

Efficient estimation of moments (mean and variance of distances) under HE is enabled by a "double-blinded coin-toss" primitive. Given ciphertext $\Enc(p)$ and modulus mm, a coin is tossed with probability p/mp/m, never revealing either the probability pp or the coin outcome. The pseudocode is:

// Client: pk; Server: P = Enc(p)
draw r ∈ {0,…,m} uniformly
compute r' := r
C ← isSmallerHE(P, r')   // returns Enc([p < r'])
return C

Here, isSmallerHE is a degree-O(logp)O(\log p) polynomial returning a homomorphically encrypted bit. To estimate the mean μ=1nidi\mu = \tfrac{1}{n} \sum_i d_i, toss nn coins with probability di/(ndmax)d_i / (n d_{\max}), sum encrypted results, and renormalize. Similarly, for μ2=1nidi2\mu_2 = \tfrac{1}{n} \sum_i d_i^2, use probabilities di2/(ndmax2)d_i^2 / (n d_{\max}^2). The variance estimate is

σ^2=μ2^(μ^)2\hat{\sigma}^2 = \widehat{\mu_2} - (\hat{\mu})^2

computed entirely in encrypted space via HE addition and multiplication.

4. Homomorphic Encryption Circuit Architecture

The classification is realized as follows:

  1. Input preparation: Client supplies $\{ \Enc(q_j) \}_{j=1}^d$; server uses clear xix_i.
  2. Distance calculation: $\Enc(d_i) = \| \Enc(q) - x_i \|$ via HE polynomials for chosen metric (1\ell_1, squared 2\ell_2).
  3. Moment estimation: Parallel double-blinded coin-tosses yield $\Enc(\hat{\mu}), \Enc(\widehat{\mu_2})$.
  4. Threshold derivation: Compute $\Enc(T) = \Enc(\hat{\mu}) + \Phi^{-1}(k/n) \cdot \Enc(\hat{\sigma})$.
  5. Majority vote: For each ii, compute isSmallerHE(di,T)isSmallerHE(d_i, T), and use for encrypted tallies of the two class labels:

$\Enc(c_1) = \sum_i isSmallerHE(d_i, T) \cdot \Enc(\mathit{class}(x_i)),\ \ \Enc(c_0) = \sum_i isSmallerHE(d_i, T) \cdot (1 - \mathit{class}(x_i))$

The overall encrypted majority is $\Enc(\mathrm{class}_q) = isSmallerHE(c_0, c_1)$.

  1. Output: Server forwards encrypted classifier output.

All modules operate in parallel across ii, resulting in circuit depth independent of database size nn: O(depth(dist)+depth(isSmaller)+logp)O\bigl(\mathrm{depth}(\mathrm{dist}) + \mathrm{depth}(\mathrm{isSmaller}) + \log p\bigr) Depth for each operation is O(logp)O(\log p), enabling scalable circuit composition. The plaintext modulus pp is chosen to be large enough to avoid wrap-around on quantized data but small enough to maintain manageable polynomial degrees p\lesssim \sqrt{p}. Optimizations include distance quantization (8–12 bits), slot-packing for batched operations, and precomputed polynomial coefficients for comparison.

5. Security and Correctness Guarantees

The protocol operates under a semi-honest (honest-but-curious) adversarial model. Homomorphic encryption ensures that the client’s query remains hidden and the server’s database is protected, revealing only the final class label. The protocol involves one message from client to server containing encrypted query coordinates, and a return message with the encrypted classification result.

Security is formalized by simulation arguments: the server's observations (public key, encrypted query and output, its database) are simulatable from random ciphertexts under HE IND-CPA security; the client’s view is limited to its input and the encrypted label. Moment estimation leverages Chernoff bounds: Pr[μ^μ>δμ]<2exp(μδ23)\Pr\left[|\hat{\mu} - \mu| > \delta \mu\right] < 2\exp \left(-\frac{\mu \delta^2}{3}\right)

Pr[σ^σ>δσ]<2exp(Ω(σ2δ2))\Pr\left[|\hat{\sigma} - \sigma| > \delta \sigma\right] < 2\exp(-\Omega(\sigma^2 \delta^2))

These concentrate the random outcomes. Combined with statistical distance terms, the probability that the selected κ\kappa strays from kk is exponentially suppressed.

6. Performance Evaluation and Practical Considerations

On the Wisconsin Breast Cancer Dataset (569 points; binary labels), plaintext kNN yields F10.985F_1 \approx 0.985; Secure k-ish NN with grid 250×250250 \times 250 achieves F10.98F_1 \approx 0.98. The classifier incurs approximately one percentage point loss in accuracy, which is compensated by a substantial reduction in computation time. Secure k-ish NN executes in less than three hours on 16 cores with HELib/BGV, whereas naive secure kNN (HE sorting) would require weeks.

Communication is minimized: client sends dd ciphertexts; server responds with one or two ciphertexts. The communication cost scales O(d)O(d) and is independent of nn. The circuit size is O(np)O(n\sqrt{p}) gates, depth O(logp)O(\log p), supporting high parallelism.

Practical implementation tips include use of BGV scheme with p300p \approx 300–500, leveraging slot-packing, precomputing coefficients for comparison and coin-toss modules, and quantizing distances before encryption.

7. Conceptual Significance and Implications

The k-ish NN classifier demonstrates that relaxing the strict nearest neighbor count to an approximate probabilistic variant fundamentally transforms the scalability of secure classification under homomorphic encryption, by replacing expensive sorting with parallelized coin-toss and comparison modules. The result is a one-round protocol supporting efficient, privacy-preserving analytics at loss of only minimal accuracy. This suggests broader scope for algorithmic relaxations in the development of practical cryptographic machine learning tools in sensitive-query contexts (Shaul et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sensitive Query Classifier.