Secure k-ish NN for Sensitive Queries

Updated 8 December 2025

The paper introduces a sensitive query classifier that uses homomorphic encryption and k-ish NN relaxation to enable secure, scalable query classification.
It employs a double-blinded coin-toss primitive to efficiently estimate statistical moments, facilitating encrypted distance computations without revealing sensitive data.
Experiments on the Wisconsin Breast Cancer Dataset show a slight accuracy trade-off (F1 ≈ 0.98) with significant gains in speed and communication efficiency.

A sensitive query classifier provides privacy-preserving classification for queries on proprietary datasets, where the client wishes to classify a query point against a database held by a server, without either party exposing their respective data. The Secure k-ish Nearest Neighbors ("k-ish NN") classifier (Shaul et al., 2018) achieves this using homomorphic encryption (HE) and algorithmic relaxations that maintain accuracy while enabling highly scalable, parallel, and communication-efficient deployment.

1. Problem Formulation and Security Constraints

Consider a server holding a database $S = \{ x_i \in \mathbb{F}_p^d \mid i = 1 \ldots n \}$ with binary class labels $\mathit{class}(x_i) \in \{0, 1\}$ , and a client holding a query point $q \in \mathbb{F}_p^d$ . The conventional kNN classifier assigns $q$ the majority label among the $k$ nearest points in $S$ : $\mathrm{class}_{k\textsf{NN}}(q) = \mathrm{maj} \bigl\{ \mathit{class}(x_i) \ \big| \ \text{dist}(q, x_i) \ \text{is among the %%%%6%%%% smallest} \bigr\}$ The sensitive query scenario mandates that (a) the client learns only the classification result, no information about $S$ , and (b) the server gains no information about $q$ nor about any intermediate decrypted values. These properties are enforced via an additively or leveled-fully homomorphic encryption scheme, providing IND-CPA security and the necessary operations on encrypted data.

2. k-ish Nearest Neighbors Relaxation

The core methodological innovation is the relaxation of exact $k$ -nearest neighbors to a probabilistic "k-ish" selection. Instead of always returning the majority over the strict $k$ nearest, the classifier computes a random $\kappa$ such that

$\Pr \bigl[ \kappa \in [(1-\delta)k, (1+\delta)k] \bigr] \ge 1-\varepsilon$

for tunably small $\delta, \varepsilon > 0$ . The empirical distance distribution $D_{S, q} = \{ d_i = \| q - x_i \| \}$ governs the statistical properties underlying the choice of threshold. If $D_{S, q}$ is approximately Gaussian, set

$T = \mu + \Phi^{-1}(k/n) \sigma$

where $\mu, \sigma$ are mean and standard deviation, and $\Phi^{-1}$ is the inverse CDF of the normal distribution. Then, the expected number of points with $d_i < T$ is $k$ . The probability distribution over possible $\kappa$ follows: $\Pr[\kappa = k'] \approx \Pr \bigl( \# \{ i \mid d_i < T \} = k' \bigr ) = \sum_{u=0}^n \binom{n}{k'} p^u (1-p)^{n-k'}$ with $p = \Phi((T-\mu)/\sigma)$ . The deviation probability is bounded: $\Pr[|\kappa - k| > \delta k ] \le 2\exp(-\Omega(k \delta^2)) + O(n s T)$ where $s$ is the statistical distance between the empirical and Gaussian models.

3. Double-Blinded Coin-Toss Primitive

Efficient estimation of moments (mean and variance of distances) under HE is enabled by a "double-blinded coin-toss" primitive. Given ciphertext $\Enc(p)$ and modulus $m$ , a coin is tossed with probability $p/m$ , never revealing either the probability $p$ or the coin outcome. The pseudocode is:

// Client: pk; Server: P = Enc(p)
draw r ∈ {0,…,m} uniformly
compute r' := r
C ← isSmallerHE(P, r')   // returns Enc([p < r'])
return C

Here, isSmallerHE is a degree- $O(\log p)$ polynomial returning a homomorphically encrypted bit. To estimate the mean $\mu = \tfrac{1}{n} \sum_i d_i$ , toss $n$ coins with probability $d_i / (n d_{\max})$ , sum encrypted results, and renormalize. Similarly, for $\mu_2 = \tfrac{1}{n} \sum_i d_i^2$ , use probabilities $d_i^2 / (n d_{\max}^2)$ . The variance estimate is

$\hat{\sigma}^2 = \widehat{\mu_2} - (\hat{\mu})^2$

computed entirely in encrypted space via HE addition and multiplication.

4. Homomorphic Encryption Circuit Architecture

The classification is realized as follows:

Input preparation: Client supplies $\{ \Enc(q_j) \}_{j=1}^d$; server uses clear $x_i$ .
Distance calculation: $\Enc(d_i) = \| \Enc(q) - x_i \|$ via HE polynomials for chosen metric ( $\ell_1$ , squared $\ell_2$ ).
Moment estimation: Parallel double-blinded coin-tosses yield $\Enc(\hat{\mu}), \Enc(\widehat{\mu_2})$.
Threshold derivation: Compute $\Enc(T) = \Enc(\hat{\mu}) + \Phi^{-1}(k/n) \cdot \Enc(\hat{\sigma})$.
Majority vote: For each $i$ , compute $isSmallerHE(d_i, T)$ , and use for encrypted tallies of the two class labels:

$\Enc(c_1) = \sum_i isSmallerHE(d_i, T) \cdot \Enc(\mathit{class}(x_i)),\ \ \Enc(c_0) = \sum_i isSmallerHE(d_i, T) \cdot (1 - \mathit{class}(x_i))$

The overall encrypted majority is $\Enc(\mathrm{class}_q) = isSmallerHE(c_0, c_1)$.

Output: Server forwards encrypted classifier output.

All modules operate in parallel across $i$ , resulting in circuit depth independent of database size $n$ : $O\bigl(\mathrm{depth}(\mathrm{dist}) + \mathrm{depth}(\mathrm{isSmaller}) + \log p\bigr)$ Depth for each operation is $O(\log p)$ , enabling scalable circuit composition. The plaintext modulus $p$ is chosen to be large enough to avoid wrap-around on quantized data but small enough to maintain manageable polynomial degrees $\lesssim \sqrt{p}$ . Optimizations include distance quantization (8–12 bits), slot-packing for batched operations, and precomputed polynomial coefficients for comparison.

5. Security and Correctness Guarantees

The protocol operates under a semi-honest (honest-but-curious) adversarial model. Homomorphic encryption ensures that the client’s query remains hidden and the server’s database is protected, revealing only the final class label. The protocol involves one message from client to server containing encrypted query coordinates, and a return message with the encrypted classification result.

Security is formalized by simulation arguments: the server's observations (public key, encrypted query and output, its database) are simulatable from random ciphertexts under HE IND-CPA security; the client’s view is limited to its input and the encrypted label. Moment estimation leverages Chernoff bounds: $\Pr\left[|\hat{\mu} - \mu| > \delta \mu\right] < 2\exp \left(-\frac{\mu \delta^2}{3}\right)$

$\Pr\left[|\hat{\sigma} - \sigma| > \delta \sigma\right] < 2\exp(-\Omega(\sigma^2 \delta^2))$

These concentrate the random outcomes. Combined with statistical distance terms, the probability that the selected $\kappa$ strays from $k$ is exponentially suppressed.

6. Performance Evaluation and Practical Considerations

On the Wisconsin Breast Cancer Dataset (569 points; binary labels), plaintext kNN yields $F_1 \approx 0.985$ ; Secure k-ish NN with grid $250 \times 250$ achieves $F_1 \approx 0.98$ . The classifier incurs approximately one percentage point loss in accuracy, which is compensated by a substantial reduction in computation time. Secure k-ish NN executes in less than three hours on 16 cores with HELib/BGV, whereas naive secure kNN (HE sorting) would require weeks.

Communication is minimized: client sends $d$ ciphertexts; server responds with one or two ciphertexts. The communication cost scales $O(d)$ and is independent of $n$ . The circuit size is $O(n\sqrt{p})$ gates, depth $O(\log p)$ , supporting high parallelism.

Practical implementation tips include use of BGV scheme with $p \approx 300$ –500, leveraging slot-packing, precomputing coefficients for comparison and coin-toss modules, and quantizing distances before encryption.

7. Conceptual Significance and Implications

The k-ish NN classifier demonstrates that relaxing the strict nearest neighbor count to an approximate probabilistic variant fundamentally transforms the scalability of secure classification under homomorphic encryption, by replacing expensive sorting with parallelized coin-toss and comparison modules. The result is a one-round protocol supporting efficient, privacy-preserving analytics at loss of only minimal accuracy. This suggests broader scope for algorithmic relaxations in the development of practical cryptographic machine learning tools in sensitive-query contexts (Shaul et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Secure $k$-ish Nearest Neighbors Classifier (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sensitive Query Classifier.