Privacy-Preserving Nearest Neighbor Search
- Privacy-preserving nearest neighbor search is a robust framework that combines cryptographic protocols with advanced indexing to compute similarity without revealing sensitive data.
- It employs techniques like homomorphic encryption, secure multiparty computation, and differential privacy to protect both data and query confidentiality, ensuring controlled information leakage.
- Recent implementations leverage hardware-assisted TEEs and oblivious data structures to optimize throughput and scalability while maintaining stringent privacy guarantees.
Privacy-preserving nearest neighbor search (PP-NNS) encompasses algorithmic frameworks and cryptographic protocols that allow similarity search over a dataset while preventing unauthorized disclosure of private information about the data, queries, or the search process. Motivated by the sensitivity of both user data (e.g., medical or biometric records, location histories, embeddings) and query intent, PP-NNS aims to ensure that service providers, adversarial third parties, or untrusted infrastructure cannot recover plaintext records, queries, or access patterns beyond the agreed-upon leakage profile. Recent developments span provably secure subroutines, efficient protocol composition, approximate/sparse encoding, as well as differential privacy and secure hardware primitives.
1. Threat Models and Fundamental Privacy Goals
PP-NNS protocols are formalized with respect to adversarial models specifying the control and capabilities of the participating parties:
- Semi-honest (honest-but-curious): Parties follow specified protocols but attempt to learn as much as possible from protocol transcripts. This is the dominant model in current PP-NNS systems (Mishra et al., 2024, Saeki et al., 20 Apr 2026, Hashem et al., 2011, Riazi et al., 2016).
- Malicious or active adversaries: More general, allowing deviations, equivocation, or coordinated attacks, but practical protocols for this setting remain rare for high-dimensional NNS due to prohibitive cryptographic overhead.
- Non-colluding co-servers: Often deployed for secure function evaluation and key separation (Mishra et al., 2024, Li et al., 2015).
The principal privacy goals include:
- Data confidentiality: Server(s) and external observers cannot invert encrypted records to recover plaintext.
- Query privacy: Server(s) and other queriers cannot infer the query vector aside from what is revealed by the final answer.
- Access-pattern hiding: Even the choice of which records are the nearest neighbors is concealed from the server whenever possible (Mishra et al., 2024).
- Leakage minimization: The only information revealed is a well-defined leakage function, e.g., output labels or approximate distance orderings.
2. Cryptographic Protocols and Building Blocks
2.1 Homomorphic Encryption and Secure Multiparty Computation
Additively homomorphic schemes, notably Paillier encryption, are widely used to protect both stored data and queries (Mishra et al., 2024, Samanthula et al., 2014). Key properties:
- Homomorphic addition and scalar multiplication: Enables secure computation of Euclidean distances under encryption without decrypting data.
- Two-party computation (e.g., SM, SSED, SBD, SMIN, SBOR): Secure protocols for multiplication, bit decomposition, and minimum selection are composed to extract the k-nearest neighbors in encrypted space while hiding all intermediate values and access patterns. These compositions are provably secure under IND-CPA in the semi-honest, non-colluding model (Mishra et al., 2024, Samanthula et al., 2014).
2.2 Product Quantization with Hybrid FHE and Secure Hardware
Product quantization (PQ), integrated with fully homomorphic encryption (FHE) and trusted execution environments (TEE), enables efficient, privacy-preserving approximate NNS at scale (Saeki et al., 20 Apr 2026):
- PQ splits vectors into subspaces, enabling distance computations over small codebooks.
- Homomorphic packing optimizes throughput, and the privacy is maintained outside the enclave.
- In-TEE distance summation and nearest neighbor selection occur after decrypting only distances—the data, codebooks, and queries stay FHE-encrypted.
- Leakage is strictly limited to distances, PQ indices, and final neighbor IDs inside the TEE.
2.3 Differential Privacy and Sketches
Differential privacy (DP) for nearest neighbor search is utilized both in database release and adaptively interactive settings:
- Random projection (Johnson–Lindenstrauss transform) with calibrated Gaussian noise (Kenthapadi et al., 2012): Ensures ()-DP for released embeddings. Pairwise distances can be statistically recovered with concentrated error, and nearest neighbor retrieval remains effective if the noise scale is appropriately chosen.
- DP filter mechanisms in in-context learning (Koskela et al., 6 Nov 2025): Track and bound per-example privacy loss during retrieval in LLM pipelines, enabling privacy budget control across adaptive queries.
2.4 Oblivious Data Structures and Data-Oblivious Algorithms
Oblivious algorithms ensure the control flow and memory-access patterns of NNS do not depend on secret data:
- Oblivious geometric algorithms (e.g., for all-nearest-neighbors): Employ quadtrees, well-separated pair decompositions, and oblivious sorting, interleaving only low-level black-box cryptographic calls for data-sensitive comparisons (Eppstein et al., 2010).
- Oblivious RAM, secret sharing, and garbled circuits: Provide foundation for sublinear-time privacy-preserving NNS, supporting advanced protocols such as SANNS (Chen et al., 2019) and threshold-protected HNSW-based search (Guo, 23 Jul 2025).
3. Privacy-Preserving Indexing and Embedding Approaches
3.1 Distance-Preserving and Stochastic Encodings
- Random Projection + Quantization (Biswas et al., 2020): A secret random orthonormal transform followed by quantization maps data into a domain where Euclidean distances are approximately preserved, but original coordinates are hidden unless the secret basis is known. When used with high-capacity ANN indices (e.g., HNSW or KD-tree), this approach provides a tunable privacy-utility trade-off; the server sees only encoded data, never the originals.
- Sparse coding with ambiguation (Razeghi et al., 2021): Input vectors are mapped to sparse sign codes; random noise is added on zero-coordinates to conceal support. Authorized users who know the true support can invert the encoding for “honest” queries, while unauthorized parties face a combinatorial search, ensuring high-distortion reconstruction.
3.2 Differentially Private Sketches and Hashing
- Secure LSH transformation (Riazi et al., 2016): LSH family hash outputs are “flattened” below a threshold to eliminate high-collision leakage for non-neighbors; secure circuit or two-server protocols hide hash seeds. Sub-linear query time is achieved, with utility for true neighbors preserved.
- DP sketch via JL + noise (Kenthapadi et al., 2012): Guarantees that even after data publication, only pairwise distances can be estimated with bounded variance; individual data cannot be reconstructed.
4. Protocol Workflows for PP-NNS
Privacy-preserving NNS protocols generally fit the following abstract workflow, with variants across the literature:
- Data preparation: Data owner encrypts or encodes the dataset using the chosen cryptographic or distance-preserving scheme, outsourcing ciphertexts or encodings to the server (Mishra et al., 2024, Saeki et al., 20 Apr 2026, Hashem et al., 2011).
- Query encryption/encoding: The user transforms their query into the compatible encrypted or encoded format.
- Secure distance computation: Homomorphic encryption, secret-sharing, or index structures (e.g., PQ, DCPE) are used to compare distances between the query and all (or indexed subsets of) database vectors. Advanced protocols use filter-and-refine approaches combining fast approximate indices with secure refinement on candidate records (Liu et al., 14 Aug 2025).
- Neighbor selection: Secure minimum-finding subprotocols—e.g., SMINₙ under Paillier or approximate top-k selection circuits/garbled circuits—identify the k nearest records without leaking which records are selected (Mishra et al., 2024, Chen et al., 2019).
- Result retrieval and post-processing: The class label or index of the nearest records is decrypted or sent to the user; for classification and voting, secure aggregation and secure maximum subprotocols may be involved (Mishra et al., 2024, Samanthula et al., 2014).
These steps are carefully orchestrated to preserve semantic security, access-pattern hiding, and resistance to adaptive attacks.
5. Security Analysis and Impossibility Results
PP-NNS protocols are typically analyzed via composition theorems, simulation-based proofs, and explicit adversarial game reductions (Mishra et al., 2024, Guo, 23 Jul 2025, Eppstein et al., 2010):
- Data and query confidentiality: Under standard cryptographic assumptions (e.g., Paillier IND-CPA), each party’s view can be simulated given only permitted outputs. No additional data is leaked.
- Access-pattern privacy: Secure minimum-selection and oblivious retrieval protocols ensure the server cannot determine which records are queried.
- Limitations under multiple-data-owner or colluding adversaries: Exact k-NN is unsatisfactory if data owners or queriers may collude, as adaptive attacks can deduce distances or reconstruct queries by observing decision boundaries in repeated queries. In these settings, approximate or kernel-based classifiers (e.g., Gaussian KDE) are provably resistant to distance-learning attacks, as the only change in output is through known, fixed kernel contributions (Li et al., 2015).
- Leakage analysis via reduction games: Recent protocols formalize all leakage, including index-structure exposure, in terms of explicit leakage functions, and prove no additional information is revealed even when aggregate indices or traversal pointers are shared (Guo, 23 Jul 2025).
6. Performance and Practical Considerations
PP-NNS systems are evaluated on metrics including computational efficiency, scalability, communication overhead, and empirical privacy/utility trade-off:
| Approach | Query Latency | Communication | Security | Scalability/Notes |
|---|---|---|---|---|
| Paillier-based kNN (Mishra et al., 2024, Samanthula et al., 2014) | Seconds (LAN, n=10k, m=16) | <100 MB/query | Full IND-CPA, pattern hiding | Linear in n∙m, parallelizable |
| PQ+FHE+TEE hybrid (Saeki et al., 20 Apr 2026) | >50 QPS (n~1M) | 1.5–6 MiB/query | Leakage: codebook/indices only | PQ/packing enables million-scale |
| Distance-preserving encoding + ANN (Biswas et al., 2020) | ms/query (p=16) | O(k) | Non-invertible, quantization | Precision/recall near raw data |
| Secure LSH (Riazi et al., 2016) | ~1 s (n=10⁹, k=8) | O(nρ+ℓ) bits | ε-secure (info-theoretic) | Sublinear query time |
| Differential privacy + JL (Kenthapadi et al., 2012) | O(nk) | O(nk) (sketch pub/release) | (ε,δ)-DP | Effective for moderate noise; subsequent fast ANN possible |
| Approximate Top-k/SANNS (Chen et al., 2019) | 30 s (n=10⁷) | 5.5 GB/query | Semi-honest, simulation-based | 10⁷-scale, multi-threaded |
Overall, deploying PP-NNS at large scale requires trade-offs between cryptographic assurance, practical throughput, and leakage minimization. Techniques such as product quantization, SIMD-packing, and hybrid TEE deployment are pivotal for practical throughput (Saeki et al., 20 Apr 2026).
Approximate and filtered search strategies, such as filter-and-refine with a privacy-preserving index, reduce the search space for secure distance computation, thereby enhancing scalability (Liu et al., 14 Aug 2025). Advanced randomized algorithms for approximate top-k in garbled circuits further alleviate the circuit bottleneck typical of exact secure computation (Chen et al., 2019).
7. Applications, Extensions, and Open Challenges
PP-NNS protocols have seen applications in outsourced data mining, privacy-preserving recommendation systems, location-based services with trajectory privacy, federated translation with cross-client datastore sharing, and context retrieval for private in-context learning (Mishra et al., 2024, Lu et al., 2015, Hashem et al., 2011, Du et al., 2023, Koskela et al., 6 Nov 2025).
Key open challenges and future directions include:
- Multi-owner and malicious threat resilience: Existing impossibility results block exact k-NN under collusion; research continues into kernel- and DP-based substitutes resilient to such attacks (Li et al., 2015).
- Dynamic updates: Efficient, privacy-preserving maintenance of the index for insertions/deletions remains a challenge, especially under secure or distributed secret sharing.
- Access-pattern hiding beyond semi-honest: Achieving O(1) leakage (e.g., through ORAM or oblivious indices) remains a central goal for high-sensitivity datasets (Saeki et al., 20 Apr 2026).
- Resource-efficient secure approximate selection: Sublinear protocols leveraging privacy-aware LSH or data-oblivious inductive structures are approaching practical scale (Riazi et al., 2016, Feng et al., 5 Jun 2025).
- Hardware-assisted secure computation: TEE-based acceleration is promising but must carefully bound side-channel and memory-leakage under adversarial models (Saeki et al., 20 Apr 2026).
- Differential privacy/utility trade-offs under adaptivity: Privacy filter mechanisms, per-example DP budgeting, and task-adaptive retrieval strategies are essential for LLM in-context search (Koskela et al., 6 Nov 2025).
Privacy-preserving nearest neighbor search is thus a rapidly advancing and technically rich field, combining cryptography, high-dimensional data analysis, and scalable system design. The current literature demonstrates both rigorous security under strong adversarial models and practical performance at million-scale search, with a spectrum of techniques balancing leakage, accuracy, and computational cost.