Secure k-ish NN for Sensitive Queries
- The paper introduces a sensitive query classifier that uses homomorphic encryption and k-ish NN relaxation to enable secure, scalable query classification.
- It employs a double-blinded coin-toss primitive to efficiently estimate statistical moments, facilitating encrypted distance computations without revealing sensitive data.
- Experiments on the Wisconsin Breast Cancer Dataset show a slight accuracy trade-off (F1 ≈ 0.98) with significant gains in speed and communication efficiency.
A sensitive query classifier provides privacy-preserving classification for queries on proprietary datasets, where the client wishes to classify a query point against a database held by a server, without either party exposing their respective data. The Secure k-ish Nearest Neighbors ("k-ish NN") classifier (Shaul et al., 2018) achieves this using homomorphic encryption (HE) and algorithmic relaxations that maintain accuracy while enabling highly scalable, parallel, and communication-efficient deployment.
1. Problem Formulation and Security Constraints
Consider a server holding a database with binary class labels , and a client holding a query point . The conventional kNN classifier assigns the majority label among the nearest points in : $\mathrm{class}_{k\textsf{NN}}(q) = \mathrm{maj} \bigl\{ \mathit{class}(x_i) \ \big| \ \text{dist}(q, x_i) \ \text{is among the %%%%6%%%% smallest} \bigr\}$ The sensitive query scenario mandates that (a) the client learns only the classification result, no information about , and (b) the server gains no information about nor about any intermediate decrypted values. These properties are enforced via an additively or leveled-fully homomorphic encryption scheme, providing IND-CPA security and the necessary operations on encrypted data.
2. k-ish Nearest Neighbors Relaxation
The core methodological innovation is the relaxation of exact -nearest neighbors to a probabilistic "k-ish" selection. Instead of always returning the majority over the strict nearest, the classifier computes a random such that
for tunably small . The empirical distance distribution governs the statistical properties underlying the choice of threshold. If is approximately Gaussian, set
where are mean and standard deviation, and is the inverse CDF of the normal distribution. Then, the expected number of points with is . The probability distribution over possible follows: with . The deviation probability is bounded: where is the statistical distance between the empirical and Gaussian models.
3. Double-Blinded Coin-Toss Primitive
Efficient estimation of moments (mean and variance of distances) under HE is enabled by a "double-blinded coin-toss" primitive. Given ciphertext $\Enc(p)$ and modulus , a coin is tossed with probability , never revealing either the probability or the coin outcome. The pseudocode is:
// Client: pk; Server: P = Enc(p)
draw r ∈ {0,…,m} uniformly
compute r' := r
C ← isSmallerHE(P, r') // returns Enc([p < r'])
return C
Here, isSmallerHE is a degree- polynomial returning a homomorphically encrypted bit. To estimate the mean , toss coins with probability , sum encrypted results, and renormalize. Similarly, for , use probabilities . The variance estimate is
computed entirely in encrypted space via HE addition and multiplication.
4. Homomorphic Encryption Circuit Architecture
The classification is realized as follows:
- Input preparation: Client supplies $\{ \Enc(q_j) \}_{j=1}^d$; server uses clear .
- Distance calculation: $\Enc(d_i) = \| \Enc(q) - x_i \|$ via HE polynomials for chosen metric (, squared ).
- Moment estimation: Parallel double-blinded coin-tosses yield $\Enc(\hat{\mu}), \Enc(\widehat{\mu_2})$.
- Threshold derivation: Compute $\Enc(T) = \Enc(\hat{\mu}) + \Phi^{-1}(k/n) \cdot \Enc(\hat{\sigma})$.
- Majority vote: For each , compute , and use for encrypted tallies of the two class labels:
$\Enc(c_1) = \sum_i isSmallerHE(d_i, T) \cdot \Enc(\mathit{class}(x_i)),\ \ \Enc(c_0) = \sum_i isSmallerHE(d_i, T) \cdot (1 - \mathit{class}(x_i))$
The overall encrypted majority is $\Enc(\mathrm{class}_q) = isSmallerHE(c_0, c_1)$.
- Output: Server forwards encrypted classifier output.
All modules operate in parallel across , resulting in circuit depth independent of database size : Depth for each operation is , enabling scalable circuit composition. The plaintext modulus is chosen to be large enough to avoid wrap-around on quantized data but small enough to maintain manageable polynomial degrees . Optimizations include distance quantization (8–12 bits), slot-packing for batched operations, and precomputed polynomial coefficients for comparison.
5. Security and Correctness Guarantees
The protocol operates under a semi-honest (honest-but-curious) adversarial model. Homomorphic encryption ensures that the client’s query remains hidden and the server’s database is protected, revealing only the final class label. The protocol involves one message from client to server containing encrypted query coordinates, and a return message with the encrypted classification result.
Security is formalized by simulation arguments: the server's observations (public key, encrypted query and output, its database) are simulatable from random ciphertexts under HE IND-CPA security; the client’s view is limited to its input and the encrypted label. Moment estimation leverages Chernoff bounds:
These concentrate the random outcomes. Combined with statistical distance terms, the probability that the selected strays from is exponentially suppressed.
6. Performance Evaluation and Practical Considerations
On the Wisconsin Breast Cancer Dataset (569 points; binary labels), plaintext kNN yields ; Secure k-ish NN with grid achieves . The classifier incurs approximately one percentage point loss in accuracy, which is compensated by a substantial reduction in computation time. Secure k-ish NN executes in less than three hours on 16 cores with HELib/BGV, whereas naive secure kNN (HE sorting) would require weeks.
Communication is minimized: client sends ciphertexts; server responds with one or two ciphertexts. The communication cost scales and is independent of . The circuit size is gates, depth , supporting high parallelism.
Practical implementation tips include use of BGV scheme with –500, leveraging slot-packing, precomputing coefficients for comparison and coin-toss modules, and quantizing distances before encryption.
7. Conceptual Significance and Implications
The k-ish NN classifier demonstrates that relaxing the strict nearest neighbor count to an approximate probabilistic variant fundamentally transforms the scalability of secure classification under homomorphic encryption, by replacing expensive sorting with parallelized coin-toss and comparison modules. The result is a one-round protocol supporting efficient, privacy-preserving analytics at loss of only minimal accuracy. This suggests broader scope for algorithmic relaxations in the development of practical cryptographic machine learning tools in sensitive-query contexts (Shaul et al., 2018).