Semi-Private Evaluation Set

Updated 20 May 2026

Semi-private evaluation sets are defined as data regimes where a subset of sensitive information remains unprotected while the majority undergoes privacy-preserving transformations.
They leverage protocols such as PSI-CA and MinHash to enable secure set similarity estimation, offering trade-offs between computational efficiency and controlled information leakage.
Applications include fairness in machine learning with noisy sensitive attributes and privacy-preserving multi-party computations, ensuring robust analytics under minimal trust assumptions.

A semi-private evaluation set refers to an evaluation protocol or data regime in which key sensitive information is partially observable, with some fraction available in unprotected or “clean” form and the rest subjected to privacy-preserving mechanisms such as noise injection or cryptographic transformations. This construct is increasingly relevant in both privacy-preserving machine learning and cryptographic protocols for secure data analysis, as it strikes a balance between utility and information leakage, supporting advanced methodologies for similarity assessment, fairness, and collaborative analytics under minimal trust assumptions (Blundo et al., 2011, Chen et al., 2022).

1. Definitions and Formal Setting

In the context of privacy-preserving computation and fair machine learning, a semi-private evaluation set typically consists of a data split where a small portion contains clean (unprotected) values for sensitive attributes or membership, while the majority is privatized—either by cryptographic transformation or local differential privacy (LDP). For example, for a binary sensitive attribute $A \in \{0,1\}$ used in fair classification, the training set $D$ is partitioned as follows:

$D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ , where $a_i$ is observed without noise.
$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ , where $\tilde{a}_j$ is an LDP-noisy or cryptographically protected version of $a_j$ .

In secure set similarity evaluation, parties may each hold private sets and only exchange encodings, sketches, or outputs that leak strictly delimited information regarding the sets’ intersection or similarity (Blundo et al., 2011).

2. Cryptographic Protocols for Semi-Private Set Similarity

The semi-private evaluation set paradigm is operationalized in set similarity estimation with privacy constraints, as exemplified by the EsPRESSo suite of protocols. The objective is to allow two entities, each holding private sets $A$ and $B$ , to compute their Jaccard similarity $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$ while minimizing information leakage.

2.1 Exact (Fully-Secure) Protocol

A classical construction uses Private Set Intersection Cardinality (PSI-CA) under the semi-honest model:

Both parties input their sets; PSI-CA outputs $D$ 0 to one party.
Both learn $D$ 1, and then compute $D$ 2.
Security is proven by simulation with an ideal functionality $D$ 3, and privacy guarantees hold under the PSI-CA assumption (Blundo et al., 2011).

2.2 Approximate Protocol via MinHash

To trade exactness for efficiency and reduced leakage, the protocol can use MinHash sketches:

Each party computes a $D$ 4-dimensional MinHash sketch (e.g., $D$ 5), each entry being the minimal hashed value of the set under $D$ 6 independent hash functions.
The parties run PSI-CA on the sketches, revealing only the number of sketch collisions $D$ 7.
The Jaccard estimator is $D$ $D$ 8, with theoretical error guarantees:
- $D$ 9;
- $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 0;
- To bound $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 1 with probability at least $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 2, set $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 3.

The “semi-private” nature follows from leakage being strictly limited to $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 4, revealing no other structure about the underlying sets (Blundo et al., 2011).

3. Semi-Private Machine Learning with Noisy Sensitive Attributes

In fair machine learning, semi-private evaluation sets address the reality that most instances have privatized (noisy) sensitive attributes, while a minority have true values. The FairSP framework is designed for this setting:

The majority of $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 5 values are privatized via LDP randomized response:

$D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 6

with

$D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 7

satisfying $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 8-LDP (Chen et al., 2022).

FairSP implements a two-stage architecture:
1. Estimation of a corruption matrix $D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}$ 9 using clean data and a preliminary classifier, followed by training a corrector $a_i$ 0 to minimize
$a_i$ 1

producing an estimate $a_i$ 2 for de-noising $a_i$ 3. 2. Adversarial debiasing combining both clean and corrected attributes through a minimax game:

$a_i$ 4

4. Security, Leakage, and Trade-offs

Semi-private evaluation protocols are characterized by formal guarantees on information leakage:

In cryptographic protocols, the only outputs are set cardinalities or similarity metrics, with the MinHash variant revealing only $a_i$ 5.
In machine learning, the true value of the sensitive attribute remains unknown on most data points, and only de-noised or adversarially protected versions are available to loss functions or predictors.

A key trade-off is between accuracy (or exactness) and privacy. For MinHash-based semi-private evaluation of Jaccard similarity, accuracy is determined by $a_i$ 6, with the error decreasing as $a_i$ 7 increases; communication and computation costs scale linearly with $a_i$ 8, compared to $a_i$ 9 for exact PSI-CA (Blundo et al., 2011). In semi-private classification, the availability of clean labels and the strength of LDP noise jointly determine fairness and prediction accuracy (Chen et al., 2022).

Approach	Computation	Communication
Exact-Jaccard (PSI-CA)	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 0 exps	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 1 elems
Approx-Jaccard (MinHash)	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 2 exps	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 3 elems
Garbled Circuits	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 4	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 5 ciphertexts
OT-based PSI	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 6 OTs	$D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 7 elems

5. Applications and Empirical Results

5.1 Privacy-Preserving Set Similarity

Semi-private set similarity protocols, exemplified by EsPRESSo, are employed in document or multimedia similarity detection (e.g., plagiarism detection using 3-grams), biometric authentication (e.g., iris code comparisons), and genetic marker testing. For large-scale sets ( $D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 8), approximate MinHash-based protocols with $D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}$ 9 achieve sub-second execution while leaking only similarity, not set elements (Blundo et al., 2011).

5.2 Fairness Under Semi-Private Sensitive Attributes

In FairSP, comprehensive experimental evidence on datasets such as ADULT (gender), COMPAS (race), and MEPS (race), using a 20% clean, 80% privatized split and $\tilde{a}_j$ 0, demonstrates:

Accuracy and F₁ scores match or slightly exceed “Clean+Private” attribute baselines.
Fairness metrics $\tilde{a}_j$ 1 and $\tilde{a}_j$ 2 are reduced by up to 50%, e.g., on ADULT: $\tilde{a}_j$ 3, $\tilde{a}_j$ 4 for FairSP, versus $\tilde{a}_j$ 5 and $\tilde{a}_j$ 6 for Clean+Private.
This effect persists even at minimal clean-data ratios (as little as 0.02%).
Competing methods that discard $\tilde{a}_j$ 7 or treat all attributes as noisy are outperformed by FairSP on both fairness and predictive accuracy (Chen et al., 2022).

6. Practical Considerations and Parameterization

Precision in protocol configuration is central to practical deployments:

MinHash-based protocols require $\tilde{a}_j$ 8 for target error $\tilde{a}_j$ 9 and failure rate $a_j$ 0.
In FairSP, large datasets with strong privacy (small $a_j$ 1) and a few clean labels necessitate accurate corruption matrix estimation and proper adversarial training. Even with strong LDP noise (flip probability $a_j$ 2 for $a_j$ 3), robust fairness and utility are attainable.

A plausible implication is that semi-private evaluation sets enable privacy-preserving analytics without prohibitive costs in utility or fairness, provided protocol parameters are carefully selected with respect to the specific application.

7. Connections and Extensions

The semi-private evaluation set paradigm generalizes to a wide range of privacy-preserving analytic tasks, including size-hiding PSI-CA and adversarial learning with hybrid label noise. It supports workflows where total non-disclosure is infeasible, but controlled, quantifiable leakage can be tolerated. These methodologies are foundational in privacy-preserving data mining, privacy-aware fairness interventions, and secure multi-party computation, linking classical cryptography (PSI, MinHash, garbled circuits) and modern machine learning under formal privacy and fairness constraints (Blundo et al., 2011, Chen et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity (2011)

When Fairness Meets Privacy: Fair Classification with Semi-Private Sensitive Attributes (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Private Evaluation Set.

Semi-Private Evaluation Set

1. Definitions and Formal Setting

2. Cryptographic Protocols for Semi-Private Set Similarity

2.1 Exact (Fully-Secure) Protocol

2.2 Approximate Protocol via MinHash

3. Semi-Private Machine Learning with Noisy Sensitive Attributes

4. Security, Leakage, and Trade-offs

5. Applications and Empirical Results

5.1 Privacy-Preserving Set Similarity

5.2 Fairness Under Semi-Private Sensitive Attributes

6. Practical Considerations and Parameterization

7. Connections and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Semi-Private Evaluation Set

1. Definitions and Formal Setting

2. Cryptographic Protocols for Semi-Private Set Similarity

2.1 Exact (Fully-Secure) Protocol

2.2 Approximate Protocol via MinHash

3. Semi-Private Machine Learning with Noisy Sensitive Attributes

4. Security, Leakage, and Trade-offs

5. Applications and Empirical Results

5.1 Privacy-Preserving Set Similarity

5.2 Fairness Under Semi-Private Sensitive Attributes

6. Practical Considerations and Parameterization

7. Connections and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research