PrettYharmonize: Leakage-Free Data Harmonization

Updated 5 February 2026

PrettYharmonize is an algorithmic framework for leakage-free data harmonization that removes site-specific variability without using true test-set labels.
It employs a pretended label strategy and meta-learning stacking to integrate predictions, ensuring preservation of biological signal in high-dimensional biomedical data.
Validation on synthetic and real-world datasets demonstrates its competitive performance in mitigating data leakage while maintaining target-related signals.

PrettYharmonize is an algorithmic framework for data harmonization in multi-site ML pipelines, specifically designed to address data leakage associated with traditional ComBat-based methods. It enables the removal of site-specific variability (effects of site, EoS) from high-dimensional biomedical data (such as MRI and clinical measurements) while rigorously avoiding the use of test-set labels during harmonization. PrettYharmonize achieves this by generating “pretended” labels for test data and deploying a meta-learning (stacking) strategy to integrate predictions, thereby preserving biological signal relevant to the target variable without introducing leakage. Its design is validated on both controlled synthetic datasets (benchmarking leakage and harmonization performance) and a range of real-world clinical and imaging data, demonstrating competitive performance in site–target dependent scenarios without overfitting to site confounds (Nieto et al., 2024).

1. Motivation and Problem Setting

The harmonization of multi-site data is foundational in biomedical ML due to inter-site variation stemming from heterogeneous scanners, acquisition protocols, and laboratory environments. When removing EoS, the central challenge is maintaining predictive information on the biological target without introducing data leakage. Data leakage, in this context, is defined as any process during harmonization whereby knowledge of test-set labels (the ground-truth target variable $y_{test}$ ) is utilized—whether directly or implicitly. Standard ComBat harmonization typically preserves biological covariates by conditioning harmonization on $y$ , requiring $y_{test}$ if performed on held-out or prospective data. This leads to over-optimistic generalization and is incompatible with real-world deployment where labels are unavailable at prediction time.

PrettYharmonize is explicitly designed to answer the following: How can ComBat-style harmonization be integrated into ML pipelines such that:

(a) all site effects (EoS) are effectively removed,
(b) the target-related signal is preserved,
(c) no use of true test labels is made at any harmonization step?

2. Mathematical and Algorithmic Framework

2.1 Generative Model and Notation

The data generative process is modeled as:

$x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$

where:

$x_{ij}$ : value of feature $i$ for sample $j$
$y_j$ : biological target (classification or regression)
$s(j)$ : site of sample $j$
$y$ 0: global intercept
$y$ 1: target (biological) effect (e.g., linear or spline)
$y$ 2: additive site effect
$y$ 3: multiplicative site-specific scaling
$y$ 4: residual noise

2.2 Pretended Label Strategy

Instead of supplying true $y$ 5 for test samples during harmonization, PrettYharmonize constructs $y$ 6 “pretended” labels $y$ 7 for each test sample. In classification, $y$ 8 equals the number of classes and each pretended label sets $y$ 9 to each class. In regression, $y_{test}$ 0 evenly tiles the target range $y_{test}$ 1.

2.3 Harmonization Transformation

For each $y_{test}$ 2, harmonization parameters $y_{test}$ 3 are estimated using only the training set, minimizing:

$y_{test}$ 4

The resulting harmonized features for sample $y_{test}$ 5 under label $y_{test}$ 6:

$y_{test}$ 7

Each sample thus yields $y_{test}$ 8 harmonized versions.

2.4 Meta-Learning (Stacking) and Prediction

A stack model is trained using inner-fold cross-validation, ingesting the $y_{test}$ 9 prediction scores generated by a base model $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 0 applied to each harmonized version of the validation set. At test time, the stack model combines the $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 1 scores for each sample to produce the final prediction. Critically, no step provides $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 2 to the harmonization model.

PrettYharmonize Algorithm Steps

Step	Description	Key Inputs
Data Split	Partition into outer train/test folds (CV)	X, Y (train), S
Inner-fold Stacking	Train harmonization models, base predictors, and stack model	X_train, Y_train, S_train
Pretended Label Harmonize	For each y*, harmonize test set using only training-fitted parameters	X_test, S_test, pretended y*
Meta-prediction	Stack-model aggregates scores for final prediction	Z_test (n_test×K)

3. Experimental Design and Evaluation Protocol

3.1 Controlled Synthetic Benchmarks (MAREoS Datasets)

Eight datasets (1,000 samples; 8 sites; 18 features).
“True” datasets: features depend on $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 3 only.
“EoS” datasets: features depend on site and target but no true signal.
Both linear and non-linear generative processes.
10 pre-defined folds; “True” sets target balanced accuracy $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 480%; “EoS” sets $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 550%.

3.2 Real-World Datasets

MRI (gray matter voxels, 3,747 features):
- Age regression and sex classification across five independent imaging datasets (AOMIC, eNKI, CamCAN, 1000Brains, SALD).
- ADNI classification: dementia vs. MCI vs. control.
Clinical data (eICU-CRD): arterial blood gases; outcome is hospital discharge (alive/expired).

3.3 Evaluation Metrics and Protocol

Classification: balanced accuracy (bACC), AUC, F1.
Regression: MAE, $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 6, age bias (corr(true, pred-true)).
5-fold or $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 7-fold cross-validation.
Performance reported as mean over folds.
Paired Wilcoxon tests used for statistical comparison (not detailed in the primary source) (Nieto et al., 2024).

4. Quantitative Results and Comparative Assessment

4.1 Synthetic Benchmarks

“True” sets: PrettYharmonize bACC comparable to random forest baseline (72–83%, difference $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 82%).
“EoS” sets: bACC $x_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}$ 9 chance (52–58%, 18–30% lower than baseline, which exploits confounded site signal).

4.2 MRI Age Regression (Site–Target Dependence)

Unharmonized MAE: 6.20 years; PrettYharmonize MAE: 4.12; WDH: 3.82; TTL: 4.28; No-Target: 15.93.
$x_{ij}$ 0: unharm: 0.81; PrettY: 0.919; WDH: 0.925; TTL: 0.912.

Site–Target Independence

No scheme improves significantly over unharmonized (MAE $x_{ij}$ 16.31).

4.3 MRI Sex Classification (Dependence)

Unharmonized AUC: 0.97, bACC: 92.6%.
PrettYharmonize AUC: 0.968, bACC: 92.2%; WDH/TTL similar; No-Target bACC: 63.1%.

Site–Target Independence

All methods perform similarly (AUC $x_{ij}$ 20.92, bACC $x_{ij}$ 385%).

4.4 ADNI Dementia Classification (Dependence)

Unharm AUC: 0.813, bACC: 73.7%.
PrettYharmonize AUC: 0.843, bACC: 77.3%; WDH/TTL similar; No-Target bACC: 60.2%.

Independence

AUC $x_{ij}$ 40.71, bACC $x_{ij}$ 565–66% across schemes.

4.5 eICU-CRD Septic Patient Discharge (Dependence)

Unharmonized AUC: 0.766; PrettYharmonize: 0.859; WDH: 0.800; TTL: 0.790; No-Target: 0.572.

Independence

All approaches yield AUC $x_{ij}$ 60.70–0.72.

5. Mechanisms of Leakage Avoidance and Theoretical Analysis

Traditional ComBat-based harmonization, when preserving biological covariate effects, necessitates inputting the true $x_{ij}$ 7 for the test data—constituting data leakage. PrettYharmonize circumvents this by replacing $x_{ij}$ 8 during test harmonization with multiple pretended label configurations. Each test sample is harmonized $x_{ij}$ 9 times (once per $i$ 0), and the downstream ML model is applied to all versions, producing a score matrix $i$ 1 of dimension $i$ 2.

A meta-learner (the “stack model”) is then trained to convert $i$ 3 into final predictions $i$ 4. If the correct $i$ 5 matches the true label, the harmonized version retains the biological signal, so the base model $i$ 6 assigns a higher-confidence score. If $i$ 7, the true signal is distorted, lowering model confidence. Thus, the stack model learns to recognize the harmonization that best preserves predictive utility for each test instance.

A key theoretical insight is that PrettYharmonize returns bACC $i$ 8 50% on synthetic “EoS”-only datasets, indicating that it does not covertly exploit site-label confounding in the absence of real biological signal. On site–target dependent tasks, it matches leakage-prone schemes (WDH, TTL) in predictive performance, demonstrating robust leakage resistance (Nieto et al., 2024).

6. Deployment and Implications for Biomedical ML

PrettYharmonize requires only the site label (and not the true target) at prediction time. It enables leakage-free harmonization even in the presence of site-label confounds and class imbalance across sites, unlike classical ComBat-based methods that are susceptible to label leakage. This makes PrettYharmonize directly deployable in clinical or prospective ML settings.

A plausible implication is that PrettYharmonize provides a general leakage-avoidance template for harmonization of high-dimensional, site-heterogeneous biomedical data, where ground-truth labels for future or external cohorts are unavailable at the time of harmonization.

For further details, including algorithmic pseudocode and data descriptions, see (Nieto et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PrettYharmonize Algorithm.

PrettYharmonize: Leakage-Free Data Harmonization

1. Motivation and Problem Setting

2. Mathematical and Algorithmic Framework

2.1 Generative Model and Notation

2.2 Pretended Label Strategy

2.3 Harmonization Transformation

2.4 Meta-Learning (Stacking) and Prediction

PrettYharmonize Algorithm Steps

3. Experimental Design and Evaluation Protocol

3.1 Controlled Synthetic Benchmarks (MAREoS Datasets)

3.2 Real-World Datasets

3.3 Evaluation Metrics and Protocol

4. Quantitative Results and Comparative Assessment

4.1 Synthetic Benchmarks

4.2 MRI Age Regression (Site–Target Dependence)

Site–Target Independence

4.3 MRI Sex Classification (Dependence)

Site–Target Independence

4.4 ADNI Dementia Classification (Dependence)

Independence

4.5 eICU-CRD Septic Patient Discharge (Dependence)

Independence

5. Mechanisms of Leakage Avoidance and Theoretical Analysis

6. Deployment and Implications for Biomedical ML

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PrettYharmonize: Leakage-Free Data Harmonization

1. Motivation and Problem Setting

2. Mathematical and Algorithmic Framework

2.1 Generative Model and Notation

2.2 Pretended Label Strategy

2.3 Harmonization Transformation

2.4 Meta-Learning (Stacking) and Prediction

PrettYharmonize Algorithm Steps

3. Experimental Design and Evaluation Protocol

3.1 Controlled Synthetic Benchmarks (MAREoS Datasets)

3.2 Real-World Datasets

3.3 Evaluation Metrics and Protocol

4. Quantitative Results and Comparative Assessment

4.1 Synthetic Benchmarks

4.2 MRI Age Regression (Site–Target Dependence)

Site–Target Independence

4.3 MRI Sex Classification (Dependence)

Site–Target Independence

4.4 ADNI Dementia Classification (Dependence)

Independence

4.5 eICU-CRD Septic Patient Discharge (Dependence)

Independence

5. Mechanisms of Leakage Avoidance and Theoretical Analysis

6. Deployment and Implications for Biomedical ML

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research