Papers
Topics
Authors
Recent
Search
2000 character limit reached

PrettYharmonize: Leakage-Free Data Harmonization

Updated 5 February 2026
  • PrettYharmonize is an algorithmic framework for leakage-free data harmonization that removes site-specific variability without using true test-set labels.
  • It employs a pretended label strategy and meta-learning stacking to integrate predictions, ensuring preservation of biological signal in high-dimensional biomedical data.
  • Validation on synthetic and real-world datasets demonstrates its competitive performance in mitigating data leakage while maintaining target-related signals.

PrettYharmonize is an algorithmic framework for data harmonization in multi-site ML pipelines, specifically designed to address data leakage associated with traditional ComBat-based methods. It enables the removal of site-specific variability (effects of site, EoS) from high-dimensional biomedical data (such as MRI and clinical measurements) while rigorously avoiding the use of test-set labels during harmonization. PrettYharmonize achieves this by generating “pretended” labels for test data and deploying a meta-learning (stacking) strategy to integrate predictions, thereby preserving biological signal relevant to the target variable without introducing leakage. Its design is validated on both controlled synthetic datasets (benchmarking leakage and harmonization performance) and a range of real-world clinical and imaging data, demonstrating competitive performance in site–target dependent scenarios without overfitting to site confounds (Nieto et al., 2024).

1. Motivation and Problem Setting

The harmonization of multi-site data is foundational in biomedical ML due to inter-site variation stemming from heterogeneous scanners, acquisition protocols, and laboratory environments. When removing EoS, the central challenge is maintaining predictive information on the biological target without introducing data leakage. Data leakage, in this context, is defined as any process during harmonization whereby knowledge of test-set labels (the ground-truth target variable ytesty_{test}) is utilized—whether directly or implicitly. Standard ComBat harmonization typically preserves biological covariates by conditioning harmonization on yy, requiring ytesty_{test} if performed on held-out or prospective data. This leads to over-optimistic generalization and is incompatible with real-world deployment where labels are unavailable at prediction time.

PrettYharmonize is explicitly designed to answer the following: How can ComBat-style harmonization be integrated into ML pipelines such that:

  • (a) all site effects (EoS) are effectively removed,
  • (b) the target-related signal is preserved,
  • (c) no use of true test labels is made at any harmonization step?

2. Mathematical and Algorithmic Framework

2.1 Generative Model and Notation

The data generative process is modeled as:

xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}

where:

  • xijx_{ij}: value of feature ii for sample jj
  • yjy_j: biological target (classification or regression)
  • s(j)s(j): site of sample jj
  • yy0: global intercept
  • yy1: target (biological) effect (e.g., linear or spline)
  • yy2: additive site effect
  • yy3: multiplicative site-specific scaling
  • yy4: residual noise

2.2 Pretended Label Strategy

Instead of supplying true yy5 for test samples during harmonization, PrettYharmonize constructs yy6 “pretended” labels yy7 for each test sample. In classification, yy8 equals the number of classes and each pretended label sets yy9 to each class. In regression, ytesty_{test}0 evenly tiles the target range ytesty_{test}1.

2.3 Harmonization Transformation

For each ytesty_{test}2, harmonization parameters ytesty_{test}3 are estimated using only the training set, minimizing:

ytesty_{test}4

The resulting harmonized features for sample ytesty_{test}5 under label ytesty_{test}6:

ytesty_{test}7

Each sample thus yields ytesty_{test}8 harmonized versions.

2.4 Meta-Learning (Stacking) and Prediction

A stack model is trained using inner-fold cross-validation, ingesting the ytesty_{test}9 prediction scores generated by a base model xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}0 applied to each harmonized version of the validation set. At test time, the stack model combines the xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}1 scores for each sample to produce the final prediction. Critically, no step provides xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}2 to the harmonization model.

PrettYharmonize Algorithm Steps

Step Description Key Inputs
Data Split Partition into outer train/test folds (CV) X, Y (train), S
Inner-fold Stacking Train harmonization models, base predictors, and stack model X_train, Y_train, S_train
Pretended Label Harmonize For each y*, harmonize test set using only training-fitted parameters X_test, S_test, pretended y*
Meta-prediction Stack-model aggregates scores for final prediction Z_test (n_test×K)

3. Experimental Design and Evaluation Protocol

3.1 Controlled Synthetic Benchmarks (MAREoS Datasets)

  • Eight datasets (1,000 samples; 8 sites; 18 features).
  • “True” datasets: features depend on xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}3 only.
  • “EoS” datasets: features depend on site and target but no true signal.
  • Both linear and non-linear generative processes.
  • 10 pre-defined folds; “True” sets target balanced accuracy xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}480%; “EoS” sets xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}550%.

3.2 Real-World Datasets

  • MRI (gray matter voxels, 3,747 features):
    • Age regression and sex classification across five independent imaging datasets (AOMIC, eNKI, CamCAN, 1000Brains, SALD).
    • ADNI classification: dementia vs. MCI vs. control.
  • Clinical data (eICU-CRD): arterial blood gases; outcome is hospital discharge (alive/expired).

3.3 Evaluation Metrics and Protocol

  • Classification: balanced accuracy (bACC), AUC, F1.
  • Regression: MAE, xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}6, age bias (corr(true, pred-true)).
  • 5-fold or xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}7-fold cross-validation.
  • Performance reported as mean over folds.
  • Paired Wilcoxon tests used for statistical comparison (not detailed in the primary source) (Nieto et al., 2024).

4. Quantitative Results and Comparative Assessment

4.1 Synthetic Benchmarks

  • “True” sets: PrettYharmonize bACC comparable to random forest baseline (72–83%, difference xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}82%).
  • “EoS” sets: bACC xij=μi+fi(yj)+αi,s(j)+δi,s(j)ϵijx_{ij} = \mu_i + f_i(y_j) + \alpha_{i,s(j)} + \delta_{i,s(j)} \cdot \epsilon_{ij}9 chance (52–58%, 18–30% lower than baseline, which exploits confounded site signal).

4.2 MRI Age Regression (Site–Target Dependence)

  • Unharmonized MAE: 6.20 years; PrettYharmonize MAE: 4.12; WDH: 3.82; TTL: 4.28; No-Target: 15.93.
  • xijx_{ij}0: unharm: 0.81; PrettY: 0.919; WDH: 0.925; TTL: 0.912.

Site–Target Independence

  • No scheme improves significantly over unharmonized (MAE xijx_{ij}16.31).

4.3 MRI Sex Classification (Dependence)

  • Unharmonized AUC: 0.97, bACC: 92.6%.
  • PrettYharmonize AUC: 0.968, bACC: 92.2%; WDH/TTL similar; No-Target bACC: 63.1%.

Site–Target Independence

  • All methods perform similarly (AUC xijx_{ij}20.92, bACC xijx_{ij}385%).

4.4 ADNI Dementia Classification (Dependence)

  • Unharm AUC: 0.813, bACC: 73.7%.
  • PrettYharmonize AUC: 0.843, bACC: 77.3%; WDH/TTL similar; No-Target bACC: 60.2%.

Independence

  • AUC xijx_{ij}40.71, bACC xijx_{ij}565–66% across schemes.

4.5 eICU-CRD Septic Patient Discharge (Dependence)

  • Unharmonized AUC: 0.766; PrettYharmonize: 0.859; WDH: 0.800; TTL: 0.790; No-Target: 0.572.

Independence

  • All approaches yield AUC xijx_{ij}60.70–0.72.

5. Mechanisms of Leakage Avoidance and Theoretical Analysis

Traditional ComBat-based harmonization, when preserving biological covariate effects, necessitates inputting the true xijx_{ij}7 for the test data—constituting data leakage. PrettYharmonize circumvents this by replacing xijx_{ij}8 during test harmonization with multiple pretended label configurations. Each test sample is harmonized xijx_{ij}9 times (once per ii0), and the downstream ML model is applied to all versions, producing a score matrix ii1 of dimension ii2.

A meta-learner (the “stack model”) is then trained to convert ii3 into final predictions ii4. If the correct ii5 matches the true label, the harmonized version retains the biological signal, so the base model ii6 assigns a higher-confidence score. If ii7, the true signal is distorted, lowering model confidence. Thus, the stack model learns to recognize the harmonization that best preserves predictive utility for each test instance.

A key theoretical insight is that PrettYharmonize returns bACC ii8 50% on synthetic “EoS”-only datasets, indicating that it does not covertly exploit site-label confounding in the absence of real biological signal. On site–target dependent tasks, it matches leakage-prone schemes (WDH, TTL) in predictive performance, demonstrating robust leakage resistance (Nieto et al., 2024).

6. Deployment and Implications for Biomedical ML

PrettYharmonize requires only the site label (and not the true target) at prediction time. It enables leakage-free harmonization even in the presence of site-label confounds and class imbalance across sites, unlike classical ComBat-based methods that are susceptible to label leakage. This makes PrettYharmonize directly deployable in clinical or prospective ML settings.

A plausible implication is that PrettYharmonize provides a general leakage-avoidance template for harmonization of high-dimensional, site-heterogeneous biomedical data, where ground-truth labels for future or external cohorts are unavailable at the time of harmonization.


For further details, including algorithmic pseudocode and data descriptions, see (Nieto et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PrettYharmonize Algorithm.