Semi-Private Evaluation Set
- Semi-private evaluation sets are defined as data regimes where a subset of sensitive information remains unprotected while the majority undergoes privacy-preserving transformations.
- They leverage protocols such as PSI-CA and MinHash to enable secure set similarity estimation, offering trade-offs between computational efficiency and controlled information leakage.
- Applications include fairness in machine learning with noisy sensitive attributes and privacy-preserving multi-party computations, ensuring robust analytics under minimal trust assumptions.
A semi-private evaluation set refers to an evaluation protocol or data regime in which key sensitive information is partially observable, with some fraction available in unprotected or “clean” form and the rest subjected to privacy-preserving mechanisms such as noise injection or cryptographic transformations. This construct is increasingly relevant in both privacy-preserving machine learning and cryptographic protocols for secure data analysis, as it strikes a balance between utility and information leakage, supporting advanced methodologies for similarity assessment, fairness, and collaborative analytics under minimal trust assumptions (Blundo et al., 2011, Chen et al., 2022).
1. Definitions and Formal Setting
In the context of privacy-preserving computation and fair machine learning, a semi-private evaluation set typically consists of a data split where a small portion contains clean (unprotected) values for sensitive attributes or membership, while the majority is privatized—either by cryptographic transformation or local differential privacy (LDP). For example, for a binary sensitive attribute used in fair classification, the training set is partitioned as follows:
- , where is observed without noise.
- , where is an LDP-noisy or cryptographically protected version of .
In secure set similarity evaluation, parties may each hold private sets and only exchange encodings, sketches, or outputs that leak strictly delimited information regarding the sets’ intersection or similarity (Blundo et al., 2011).
2. Cryptographic Protocols for Semi-Private Set Similarity
The semi-private evaluation set paradigm is operationalized in set similarity estimation with privacy constraints, as exemplified by the EsPRESSo suite of protocols. The objective is to allow two entities, each holding private sets and , to compute their Jaccard similarity while minimizing information leakage.
2.1 Exact (Fully-Secure) Protocol
A classical construction uses Private Set Intersection Cardinality (PSI-CA) under the semi-honest model:
- Both parties input their sets; PSI-CA outputs 0 to one party.
- Both learn 1, and then compute 2.
- Security is proven by simulation with an ideal functionality 3, and privacy guarantees hold under the PSI-CA assumption (Blundo et al., 2011).
2.2 Approximate Protocol via MinHash
To trade exactness for efficiency and reduced leakage, the protocol can use MinHash sketches:
- Each party computes a 4-dimensional MinHash sketch (e.g., 5), each entry being the minimal hashed value of the set under 6 independent hash functions.
- The parties run PSI-CA on the sketches, revealing only the number of sketch collisions 7.
- The Jaccard estimator is 8, with theoretical error guarantees:
- 9;
- 0;
- To bound 1 with probability at least 2, set 3.
The “semi-private” nature follows from leakage being strictly limited to 4, revealing no other structure about the underlying sets (Blundo et al., 2011).
3. Semi-Private Machine Learning with Noisy Sensitive Attributes
In fair machine learning, semi-private evaluation sets address the reality that most instances have privatized (noisy) sensitive attributes, while a minority have true values. The FairSP framework is designed for this setting:
- The majority of 5 values are privatized via LDP randomized response:
6
with
7
satisfying 8-LDP (Chen et al., 2022).
- FairSP implements a two-stage architecture:
- Estimation of a corruption matrix 9 using clean data and a preliminary classifier, followed by training a corrector 0 to minimize
1
producing an estimate 2 for de-noising 3. 2. Adversarial debiasing combining both clean and corrected attributes through a minimax game:
4
4. Security, Leakage, and Trade-offs
Semi-private evaluation protocols are characterized by formal guarantees on information leakage:
- In cryptographic protocols, the only outputs are set cardinalities or similarity metrics, with the MinHash variant revealing only 5.
- In machine learning, the true value of the sensitive attribute remains unknown on most data points, and only de-noised or adversarially protected versions are available to loss functions or predictors.
A key trade-off is between accuracy (or exactness) and privacy. For MinHash-based semi-private evaluation of Jaccard similarity, accuracy is determined by 6, with the error decreasing as 7 increases; communication and computation costs scale linearly with 8, compared to 9 for exact PSI-CA (Blundo et al., 2011). In semi-private classification, the availability of clean labels and the strength of LDP noise jointly determine fairness and prediction accuracy (Chen et al., 2022).
| Approach | Computation | Communication |
|---|---|---|
| Exact-Jaccard (PSI-CA) | 0 exps | 1 elems |
| Approx-Jaccard (MinHash) | 2 exps | 3 elems |
| Garbled Circuits | 4 | 5 ciphertexts |
| OT-based PSI | 6 OTs | 7 elems |
5. Applications and Empirical Results
5.1 Privacy-Preserving Set Similarity
Semi-private set similarity protocols, exemplified by EsPRESSo, are employed in document or multimedia similarity detection (e.g., plagiarism detection using 3-grams), biometric authentication (e.g., iris code comparisons), and genetic marker testing. For large-scale sets (8), approximate MinHash-based protocols with 9 achieve sub-second execution while leaking only similarity, not set elements (Blundo et al., 2011).
5.2 Fairness Under Semi-Private Sensitive Attributes
In FairSP, comprehensive experimental evidence on datasets such as ADULT (gender), COMPAS (race), and MEPS (race), using a 20% clean, 80% privatized split and 0, demonstrates:
- Accuracy and F₁ scores match or slightly exceed “Clean+Private” attribute baselines.
- Fairness metrics 1 and 2 are reduced by up to 50%, e.g., on ADULT: 3, 4 for FairSP, versus 5 and 6 for Clean+Private.
- This effect persists even at minimal clean-data ratios (as little as 0.02%).
- Competing methods that discard 7 or treat all attributes as noisy are outperformed by FairSP on both fairness and predictive accuracy (Chen et al., 2022).
6. Practical Considerations and Parameterization
Precision in protocol configuration is central to practical deployments:
- MinHash-based protocols require 8 for target error 9 and failure rate 0.
- In FairSP, large datasets with strong privacy (small 1) and a few clean labels necessitate accurate corruption matrix estimation and proper adversarial training. Even with strong LDP noise (flip probability 2 for 3), robust fairness and utility are attainable.
A plausible implication is that semi-private evaluation sets enable privacy-preserving analytics without prohibitive costs in utility or fairness, provided protocol parameters are carefully selected with respect to the specific application.
7. Connections and Extensions
The semi-private evaluation set paradigm generalizes to a wide range of privacy-preserving analytic tasks, including size-hiding PSI-CA and adversarial learning with hybrid label noise. It supports workflows where total non-disclosure is infeasible, but controlled, quantifiable leakage can be tolerated. These methodologies are foundational in privacy-preserving data mining, privacy-aware fairness interventions, and secure multi-party computation, linking classical cryptography (PSI, MinHash, garbled circuits) and modern machine learning under formal privacy and fairness constraints (Blundo et al., 2011, Chen et al., 2022).