Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semi-Private Evaluation Set

Updated 20 May 2026
  • Semi-private evaluation sets are defined as data regimes where a subset of sensitive information remains unprotected while the majority undergoes privacy-preserving transformations.
  • They leverage protocols such as PSI-CA and MinHash to enable secure set similarity estimation, offering trade-offs between computational efficiency and controlled information leakage.
  • Applications include fairness in machine learning with noisy sensitive attributes and privacy-preserving multi-party computations, ensuring robust analytics under minimal trust assumptions.

A semi-private evaluation set refers to an evaluation protocol or data regime in which key sensitive information is partially observable, with some fraction available in unprotected or “clean” form and the rest subjected to privacy-preserving mechanisms such as noise injection or cryptographic transformations. This construct is increasingly relevant in both privacy-preserving machine learning and cryptographic protocols for secure data analysis, as it strikes a balance between utility and information leakage, supporting advanced methodologies for similarity assessment, fairness, and collaborative analytics under minimal trust assumptions (Blundo et al., 2011, Chen et al., 2022).

1. Definitions and Formal Setting

In the context of privacy-preserving computation and fair machine learning, a semi-private evaluation set typically consists of a data split where a small portion contains clean (unprotected) values for sensitive attributes or membership, while the majority is privatized—either by cryptographic transformation or local differential privacy (LDP). For example, for a binary sensitive attribute A{0,1}A \in \{0,1\} used in fair classification, the training set DD is partitioned as follows:

  • Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}, where aia_i is observed without noise.
  • Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}, where a~j\tilde{a}_j is an LDP-noisy or cryptographically protected version of aja_j.

In secure set similarity evaluation, parties may each hold private sets and only exchange encodings, sketches, or outputs that leak strictly delimited information regarding the sets’ intersection or similarity (Blundo et al., 2011).

2. Cryptographic Protocols for Semi-Private Set Similarity

The semi-private evaluation set paradigm is operationalized in set similarity estimation with privacy constraints, as exemplified by the EsPRESSo suite of protocols. The objective is to allow two entities, each holding private sets AA and BB, to compute their Jaccard similarity J(A,B)=ABABJ(A,B) = \frac{|A \cap B|}{|A \cup B|} while minimizing information leakage.

2.1 Exact (Fully-Secure) Protocol

A classical construction uses Private Set Intersection Cardinality (PSI-CA) under the semi-honest model:

  • Both parties input their sets; PSI-CA outputs DD0 to one party.
  • Both learn DD1, and then compute DD2.
  • Security is proven by simulation with an ideal functionality DD3, and privacy guarantees hold under the PSI-CA assumption (Blundo et al., 2011).

2.2 Approximate Protocol via MinHash

To trade exactness for efficiency and reduced leakage, the protocol can use MinHash sketches:

  • Each party computes a DD4-dimensional MinHash sketch (e.g., DD5), each entry being the minimal hashed value of the set under DD6 independent hash functions.
  • The parties run PSI-CA on the sketches, revealing only the number of sketch collisions DD7.
  • The Jaccard estimator is DD8, with theoretical error guarantees:
    • DD9;
    • Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}0;
    • To bound Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}1 with probability at least Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}2, set Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}3.

The “semi-private” nature follows from leakage being strictly limited to Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}4, revealing no other structure about the underlying sets (Blundo et al., 2011).

3. Semi-Private Machine Learning with Noisy Sensitive Attributes

In fair machine learning, semi-private evaluation sets address the reality that most instances have privatized (noisy) sensitive attributes, while a minority have true values. The FairSP framework is designed for this setting:

  • The majority of Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}5 values are privatized via LDP randomized response:

Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}6

with

Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}7

satisfying Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}8-LDP (Chen et al., 2022).

  • FairSP implements a two-stage architecture:

    1. Estimation of a corruption matrix Dclean={(xi,ai,yi):iIclean}D_{clean} = \{(x_i, a_i, y_i): i \in I_{clean}\}9 using clean data and a preliminary classifier, followed by training a corrector aia_i0 to minimize

    aia_i1

    producing an estimate aia_i2 for de-noising aia_i3. 2. Adversarial debiasing combining both clean and corrected attributes through a minimax game:

    aia_i4

4. Security, Leakage, and Trade-offs

Semi-private evaluation protocols are characterized by formal guarantees on information leakage:

  • In cryptographic protocols, the only outputs are set cardinalities or similarity metrics, with the MinHash variant revealing only aia_i5.
  • In machine learning, the true value of the sensitive attribute remains unknown on most data points, and only de-noised or adversarially protected versions are available to loss functions or predictors.

A key trade-off is between accuracy (or exactness) and privacy. For MinHash-based semi-private evaluation of Jaccard similarity, accuracy is determined by aia_i6, with the error decreasing as aia_i7 increases; communication and computation costs scale linearly with aia_i8, compared to aia_i9 for exact PSI-CA (Blundo et al., 2011). In semi-private classification, the availability of clean labels and the strength of LDP noise jointly determine fairness and prediction accuracy (Chen et al., 2022).

Approach Computation Communication
Exact-Jaccard (PSI-CA) Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}0 exps Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}1 elems
Approx-Jaccard (MinHash) Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}2 exps Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}3 elems
Garbled Circuits Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}4 Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}5 ciphertexts
OT-based PSI Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}6 OTs Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}7 elems

5. Applications and Empirical Results

5.1 Privacy-Preserving Set Similarity

Semi-private set similarity protocols, exemplified by EsPRESSo, are employed in document or multimedia similarity detection (e.g., plagiarism detection using 3-grams), biometric authentication (e.g., iris code comparisons), and genetic marker testing. For large-scale sets (Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}8), approximate MinHash-based protocols with Dpriv={(xj,a~j,yj):jIpriv}D_{priv} = \{(x_j, \tilde{a}_j, y_j): j \in I_{priv}\}9 achieve sub-second execution while leaking only similarity, not set elements (Blundo et al., 2011).

5.2 Fairness Under Semi-Private Sensitive Attributes

In FairSP, comprehensive experimental evidence on datasets such as ADULT (gender), COMPAS (race), and MEPS (race), using a 20% clean, 80% privatized split and a~j\tilde{a}_j0, demonstrates:

  • Accuracy and F₁ scores match or slightly exceed “Clean+Private” attribute baselines.
  • Fairness metrics a~j\tilde{a}_j1 and a~j\tilde{a}_j2 are reduced by up to 50%, e.g., on ADULT: a~j\tilde{a}_j3, a~j\tilde{a}_j4 for FairSP, versus a~j\tilde{a}_j5 and a~j\tilde{a}_j6 for Clean+Private.
  • This effect persists even at minimal clean-data ratios (as little as 0.02%).
  • Competing methods that discard a~j\tilde{a}_j7 or treat all attributes as noisy are outperformed by FairSP on both fairness and predictive accuracy (Chen et al., 2022).

6. Practical Considerations and Parameterization

Precision in protocol configuration is central to practical deployments:

  • MinHash-based protocols require a~j\tilde{a}_j8 for target error a~j\tilde{a}_j9 and failure rate aja_j0.
  • In FairSP, large datasets with strong privacy (small aja_j1) and a few clean labels necessitate accurate corruption matrix estimation and proper adversarial training. Even with strong LDP noise (flip probability aja_j2 for aja_j3), robust fairness and utility are attainable.

A plausible implication is that semi-private evaluation sets enable privacy-preserving analytics without prohibitive costs in utility or fairness, provided protocol parameters are carefully selected with respect to the specific application.

7. Connections and Extensions

The semi-private evaluation set paradigm generalizes to a wide range of privacy-preserving analytic tasks, including size-hiding PSI-CA and adversarial learning with hybrid label noise. It supports workflows where total non-disclosure is infeasible, but controlled, quantifiable leakage can be tolerated. These methodologies are foundational in privacy-preserving data mining, privacy-aware fairness interventions, and secure multi-party computation, linking classical cryptography (PSI, MinHash, garbled circuits) and modern machine learning under formal privacy and fairness constraints (Blundo et al., 2011, Chen et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Private Evaluation Set.