- The paper introduces a novel OT-based FPSI protocol that enables privacy-preserving deduplication by leveraging fuzzy matching in error-prone humanitarian datasets.
- It utilizes a minhash-based LSH embedding to transform biographical data into Hamming space, achieving 99.43% recall with a false positive rate below 0.1%.
- Experimental results demonstrate that xDup is 84× faster than SMC-based methods, ensuring efficient and scalable deduplication for large-scale, resource-constrained scenarios.
xDup: Privacy-Preserving Deduplication for Humanitarian Organizations using Fuzzy PSI
The challenge addressed in "xDup: Privacy-Preserving Deduplication for Humanitarian Organizations using Fuzzy PSI" (2604.08019) is enabling cross-organizational deduplication of biographical recipient records in the humanitarian domain, while guaranteeing stringent privacy properties under adversarial and resource-limited settings. Humanitarian organizations increasingly rely on deduplication across independent field teams to ensure equitable distribution of scarce resources. Unlike corporate or governmental settings, these organizations lack the infrastructure to use strong global identifiers or biometrics, operating primarily with manually collected, error-prone, and incomplete biographical data.
The authors derive functional requirements—high recall in duplicate identification, support for quasi-identifier-based fuzzy matching, and operation without available unique IDs—as well as deployment requirements, such as operation under unreliable networking, low local computational footprint, batch (offline) and interactive (online) modes, and scaling to databases of 105 records and weekly ingest of thousands. Privacy requirements are particularly strict: the system must not leak information about non-duplicates, and must minimize false positives (target FPR <0.1%), since each potential duplicate leads to further inter-organizational disclosure during manual adjudication. The threat analysis assumes primarily honest-but-curious headquarters, potentially malicious/corrupt field teams (for deduplication protocol steps), and a system model in which strong out-of-band validity checking is enforced for registration data.
xDup System Design: Embedding, Outsourcing, and Protocol Selection
The system's architecture is tailored towards this regime. Registration data is embedded into fixed-length bit strings in Hamming space using a novel minhash-based locality sensitive hashing (LSH) construction, with domain separation and parameterization to control error rates. This embedding is shown empirically to preserve plaintext matching fidelity (Figure 1):
Figure 1: Schematic illustration of xDup's deduplication protocol leveraging transformation into Hamming space, followed by efficient fuzzy set intersection.
Figure 2: High-level deduplication process in humanitarian organizations, illustrating multi-stage registration, private deduplication, and manual adjudication pipelines.
Field teams generate shares of their local, embedded registrations and transmit these to two independent compute servers operated by different organizational headquarters. The core deduplication proceeds via a secret-shared fuzzy private set intersection (FPSI) protocol: given embedded queries and database records, the protocol privately identifies all pairs within a Hamming distance threshold, optimizing for high recall and low false positives as empirically validated (see below).
The protocol stack is built on an outsourced computation model, avoiding reliance on expensive, general secure multi-party computation (SMC) or homomorphic encryption (HE). Instead, the FPSI protocol operates over secret-shared inputs and outputs, requiring only non-collusion of the compute nodes—an assumption justified by the independent nature of international NGO headquarters.
otFPSI: Efficient Fuzzy Private Set Intersection via Oblivious Transfer
At the technical core of xDup is otFPSI, a concrete FPSI protocol for Hamming space inspired by SHADE [Bringer 2013]. otFPSI supports threshold comparisons at scale using only oblivious transfer (OT), and is free of restrictive input assumptions that limit prior works. The protocol securely computes the Hamming distance between two bitstrings, using a novel application and batching of 1-out-of-N OT to obtain secret shares of the distance, followed by secure thresholding.
This construction yields several advantages:
- Assumption-free accuracy: All returned matches are exact with no reliance on distributional input properties or separation assumptions.
- Fine-grained batching: OT communication is efficiently batched, supporting quasi-linear complexity in the record length.
- Deployable in a secret-shared context: Protocol design natively supports both local and outsourced secret-shared FPSI.
Careful experimental analysis demonstrates otFPSI outperforms all prior approaches under the humanitarian sector's operational parameters—particularly for large dimension l (typical l≈500) and high threshold τ (e.g., τ=l/4). The system is robust to high-dimensionality induced by realistic biographical field concatenations, in contrast to Euclidean-space FPSI protocols, which are computationally infeasible beyond moderate dimensions.
Empirical Evaluation
The work reports systematic evaluation using a synthetic dataset designed to mimic typical humanitarian registration data and noise, substantiated by domain consultation. The embedding achieves a false negative rate below 0.57% at a false positive rate of just 0.098% for l = 511 and τ=132 (Figure 3), i.e., 99.43% recall with only 0.1% spurious matches—a strong result in practice.

Figure 3: Tradeoff between false positive and false negative rates for the learned embedding, evaluated on realistic-scale deduplication (131,072 records).
This is competitive with plaintext record linkage baselines such as EpiLink and Jaccard, and substantially better than LSH-based approximate PSI [Adir 2022] at the same FPR, which achieves only ~86% recall even at lenient parameterizations (Figure 4).
Figure 4: ROC curve illustrating the inferior recall of Private Approximate Jaccard LSH compared to xDup's embedding for low FPR regimes.
In protocol benchmarks, xDup realizes deduplication on batches of 2,048 queries over a 131,072-record database in ~3 hours. This is 84× faster than state-of-the-art SMC-based approaches (MainSEL [Stammler 2020]) and outperforms all publicly benchmarked FPSI protocols (FLPSI, DA-PSI, Approx-PSI, Fmap-FPSI, PE-FPSI) for both runtime and communication at relevant parameter points (Figure 5 and Figure 6).

Figure 5: Comparative run times of otFPSI (our protocol), SilentOT, DA-PSI, and Approx-PSI under realistic set sizes and network conditions.
Figure 6: Scaling characteristics of otFPSI with protocol dimension (<0.1%0), highlighting run times across settings and batched query sizes.
In the secret-shared extension, the extra overhead is modest—runtime increases are typically within a factor of 5 on fast networks, while the communication overhead is minimized through correlated OT and further batching (otFPSI-ssb variant).
Contrasts with Prior Art
- Bloom filter-based privacy-preserving RL: Existing biographical deduplication over Bloom filters is vulnerable to privacy leakage via linkage and membership inference; xDup's embedding and FPSI protocol are resilient to these attacks by design.
- Blocking/differential privacy approaches: These methods typically optimize for scalability at the cost of false negatives or privacy leaks; xDup's approach returns all true matches within the fuzziness threshold and limits privacy leakage.
- Euclidean-space FPSI: Protocols for Euclidean embeddings are infeasible at realistic record dimensions post-embedding (often <0.1%1), and require strong separation assumptions that do not hold for short, error-prone biographical fields.
- State-of-the-art FPSI: All recent practical FPSI protocols either (A) rely on restrictive assumptions about distance gaps between records, which are not satisfied in high-noise deduplication, (B) only support low-dimension/thresholds, or (C) require large communication or computational resources, rendering them unsuitable for deployment in low-resource humanitarian settings.
Theoretical and Practical Implications
From a theoretical standpoint, xDup extends the applicability of oblivious transfer to large-scale, batching-based private set intersection in Hamming space. The secret-sharing and outsourcing approach balances security against collusion and resource-constrained operation. The protocol identifiably advances FPSI by demonstrating how careful composition and fine-tuned batching of classic OT primitives yields practical, scalable solutions for deduplication on noisy biographical data.
Practically, xDup constitutes a ready-to-deploy system for cross-institutional deduplication in crisis settings, directly enabling coordinated, privacy-safe aid distribution. It minimizes interactive involvement of field teams, outsources heavy protocol steps to headquarters, and is resilient to accidental errors in registration. The absence of unique identifiers and the explicit avoidance of biometric processing mean that xDup is compatible with evolving legal and ethical standards in humanitarian response.
Limitations and Directions for Future Work
xDup presupposes non-collusion of compute nodes (typically different headquarters), and relies on strong out-of-band mechanisms for fraud/registration verification—a necessity in biographical linkage outside adversarial identity settings. While the embedding is highly accurate, it is, by construction, a black-box step, and alternate or learned embeddings may yield higher recall or require less communication if tailored to application specifics.
Potential avenues for further work include:
- Learning task-specific embeddings: Exploring deep or metric-learning based embeddings tailored to expected registration errors.
- Adaptive privacy-utility parameterization: Dynamically tuning thresholds or embedding sizes based on observed FNR/FPR or organizational requirements.
- Robustness analysis for malicious input attacks: Extending the threat model to cover active attacks on the registration stage through adversarial synthesis.
- Communication and bandwidth optimization: Further investigation of hybrid batching techniques and low-overhead correlated OTe variants valuable in settings with extremely limited bandwidth.
Conclusion
xDup represents a compelling confluence of practical system construction, cryptographically rigorous protocol analysis, and real-world performance evaluation tailored to a critically underserved problem class. By leveraging efficient OT-based Hamming FPSI with secret sharing, xDup achieves high-accuracy, privacy-preserving deduplication at scale while minimizing operational burden on resource-limited humanitarian actors. The open-source release and comprehensive benchmarking further position this work as a reference point for future FPSI and real-world, privacy-safe deduplication protocol development.