Conformal Data Contamination Tests

Updated 21 July 2025

Conformal data contamination tests are distribution‐free statistical procedures that leverage exchangeability and nonconformity scoring to detect and quantify outliers, mislabeled samples, and distributional deviations.
They compute conformal p-values from reference data to provide finite-sample inference and strict type-I error control for identifying contaminated datasets.
These tests are crucial in collaborative learning and data marketplaces, offering practical quality guarantees before data aggregation and model training.

Conformal data contamination tests are a class of statistical procedures designed to detect, quantify, and control the presence of contaminated data—such as outliers, mislabeled samples, or distributional deviations—in datasets used for model training, evaluation, trading, or sharing. Unlike traditional parametric approaches, conformal methods offer rigorous, distribution-free guarantees on validity, leveraging the principles of exchangeability and nonconformity scoring to provide finite-sample inference even under arbitrary contamination. These tests have become increasingly important for both classification and outlier detection applications, especially in collaborative learning, federated learning, data marketplaces, and situations where data provenance or quality cannot be fully trusted.

1. Contamination-Aware Testing Frameworks

The conformal data contamination test framework operates in contexts where a data consumer seeks to augment their own “inlier” or clean data (with distribution $P_0$ ) by acquiring potentially contaminated datasets from external agents. These external datasets are modeled as mixtures: $P_k = (1 - \pi_k) P_0 + \pi_k P_k^{\mathrm{outlier}}$ where $\pi_k$ is the (unknown) contamination fraction for the $k$ -th agent. The goal is to determine, without distributional assumptions, whether $\pi_k$ is below a user-specified tolerance $\pi_{\mathrm{th}}$ before aggregating or purchasing the data.

Conformal tests approach this by using a set of reference data from $P_0$ (the requester’s own dataset) to define nonconformity or conformity scores. These are then used to compute conformal p-values for each new observation from an agent, which measure how “atypical” the observation is relative to the clean reference. By aggregating the evidence from many such p-values, the framework provides rigorous, type-I error controlled tests for contamination at the dataset (agent) level (Vejling et al., 18 Jul 2025).

2. Testing Procedures and Aggregation of Conformal p-values

The central practical step is the computation of conformal p-values for each candidate sample $Z_{n+i}$ received from an agent: $\hat{p}_i = \frac{1}{n-\ell+1}\left(1 + \sum_{j=\ell+1}^n \mathbbm{1}[\hat{s}_j \leq \hat{s}_{n+i}]\right)$ where $\{\hat{s}_j\}$ are conformity scores of the reference (calibration) data, and $\hat{s}_{n+i}$ is the score of the new sample.

Multiple recent aggregation techniques are supported:

Storey estimator: $T^{\rm storey} = \sum_{i=1}^m \mathbbm{1}[\hat{p}_i > \lambda]$ and the rejection probability is computed using the known distribution under the null.
Quantile-based aggregation: The empirical quantile of the p-values is compared to the expected uniform under $P_0$ .
Fisher’s method or sum statistics: These combine p-value evidence additively or multiplicatively.

All such tests provide non-asymptotic, finite-sample guarantees of type-I error control (the probability of falsely accepting contaminated data is bounded by the desired significance level). The explicit expressions for test p-values are derived using properties of uniform and negative hypergeometric distributions.

When analyzing data from multiple agents simultaneously, the test outputs can be submitted to the Benjamini–Hochberg (BH) procedure for false discovery rate (FDR) control, as the Storey-based p-values satisfy the required positive regression dependency (PRDS) property (Vejling et al., 18 Jul 2025).

3. Theoretical Guarantees and Statistical Properties

The theoretical underpinnings of conformal data contamination tests rest on the distribution-free property of conformal p-values: when a new sample is truly in-distribution, its p-value is uniformly distributed, regardless of the specifics of $P_0$ . This invariance ensures that the tests are valid under arbitrary contamination structures and even heterogeneous outlier distributions among agents.

Direct calculation using combinatorial arguments shows that the aggregated test statistics (such as $T^{\rm storey}$ ) have an exact, known distribution under the null hypothesis that the agent’s data meets the contamination tolerance. Explicit bounding formulas—for example, using the negative hypergeometric cumulative distribution function—permit finite-sample, sharp control of type-I error for each agent.

When multiple agents are tested, the p-values’ PRDS property ensures that the overall FDR is controlled after BH correction, yielding statistically rigorous agent selection for downstream tasks.

4. Empirical Applications and Performance

Extensive empirical studies demonstrate the robustness and utility of conformal data contamination tests for collaborative/federated learning and data trading/sharing. Experiments involve image datasets such as MNIST and FEMNIST, where agents provide data potentially subject to:

Label noise (incorrect class annotation),
Feature noise (corrupted input measurements),
Distributional shifts (e.g., mixtures of uppercase and lowercase characters).

The framework effectively identifies agents whose data pass the contamination threshold, filtering out high-contamination agents and thereby maintaining the accuracy of the learner’s personalized models. Reported metrics include the true discovery rate (correctly accepting low-contamination agents), false discovery rate, and the power of the test (probability of correctly rejecting contaminated sources). Empirical results closely track oracle (ground truth) benchmarks in controlling the quality of aggregated datasets (Vejling et al., 18 Jul 2025).

From a practical standpoint, conformal data contamination tests enable buyers/data requesters to obtain quality guarantees before acquiring external data. The procedure is entirely data-driven and requires no modeling of the data sources’ distributional form, making it applicable in a broad range of scenarios:

Data marketplaces, where buyers wish to purchase only “inliers” from sellers whose data match the required distribution.
Federated/collaborative learning, where protocol participants’ data must be pre-screened for relevance and cleanliness before aggregation or model update.
Medical or scientific consortia, where strict contamination thresholds are required due to downstream regulatory or safety implications.

By integrating the framework into routine workflows, organizations can ensure that only statistically justifiable, relevant datasets are used for high-stakes training or evaluation, thus reducing negative impacts from irrelevant, corrupted, or adversarial data.

6. Mathematical Formulations

Key mathematical expressions that define the testing procedure include:

Conformal p-value for new sample:

$\hat{p}_i = \frac{1}{n-\ell+1}\left(1 + \sum_{j=\ell+1}^n \mathbbm{1}[\hat{s}_j \leq \hat{s}_{n+i}]\right)$

Aggregated Storey test statistic:

$T^{\rm storey} = \sum_{i=1}^m \mathbbm{1}[\hat{p}_i > \lambda]$

Valid p-value for contamination level $\pi_{\mathrm{th}}$ :

$\hat{u}^{\rm storey} = \sum_{k=T^{\rm storey}+1}^{m} B_{\pi_{\mathrm{th}}}(k) F_{\mathrm{NHG}}\left(\lfloor \lambda (n+1) \rfloor - 1; n+k, n, k-T^{\rm storey}\right) + ...$

These ensure that the test remains exactly or asymptotically valid, regardless of unknowns in the data generating process.

7. Significance for Conformal Methods and Future Directions

The conformal data contamination testing methodology stands out for its distribution-free rigor, applicability in high-stakes data trading, and its role as a practical tool for quality assurance prior to model training or decision making. Unlike purely post hoc measures (e.g., data valuation), these tests offer interpretable, pre-aggregation quality guarantees. By naturally integrating with multiple hypothesis testing correction procedures, they remain robust in large-scale, multi-agent, or multi-benchmark scenarios.

Ongoing directions include the development of even more powerful conformal contamination tests for more complex contamination regimes (e.g., adversarial insertions, mixed outlier types), finer-grained per-sample elimination, and scalable implementations suitable for streaming, high-dimensional data environments.

Table: Key Steps in Conformal Data Contamination Tests

Step	Purpose	Example Implementation
Reference selection	Define clean calibration dataset from P₀	Use local data or trusted subset
Conformal p-values	Quantify conformity of each agent’s sample	Compute $\hat{p}_i$ for new data
Aggregation	Test global contamination versus threshold	Storey/Fisher/Quantile/Sum test
FDR control	Screen multiple agents fairly	Benjamini–Hochberg procedure
Acquisition/filter	Admit/reject agents for training or sharing	Select agents with $\pi_k \leq \pi_{\rm th}$

Conformal data contamination tests provide a generic, theoretically justified toolkit for distinguishing and aggregating high-quality data with rigorous statistical guarantees, supporting collaborative, secure, and fair data-driven machine learning workflows (Vejling et al., 18 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Conformal Data Contamination Tests for Trading or Sharing of Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Conformal Data Contamination Tests.