PrettYharmonize: Leakage-Free Data Harmonization
- PrettYharmonize is an algorithmic framework for leakage-free data harmonization that removes site-specific variability without using true test-set labels.
- It employs a pretended label strategy and meta-learning stacking to integrate predictions, ensuring preservation of biological signal in high-dimensional biomedical data.
- Validation on synthetic and real-world datasets demonstrates its competitive performance in mitigating data leakage while maintaining target-related signals.
PrettYharmonize is an algorithmic framework for data harmonization in multi-site ML pipelines, specifically designed to address data leakage associated with traditional ComBat-based methods. It enables the removal of site-specific variability (effects of site, EoS) from high-dimensional biomedical data (such as MRI and clinical measurements) while rigorously avoiding the use of test-set labels during harmonization. PrettYharmonize achieves this by generating “pretended” labels for test data and deploying a meta-learning (stacking) strategy to integrate predictions, thereby preserving biological signal relevant to the target variable without introducing leakage. Its design is validated on both controlled synthetic datasets (benchmarking leakage and harmonization performance) and a range of real-world clinical and imaging data, demonstrating competitive performance in site–target dependent scenarios without overfitting to site confounds (Nieto et al., 2024).
1. Motivation and Problem Setting
The harmonization of multi-site data is foundational in biomedical ML due to inter-site variation stemming from heterogeneous scanners, acquisition protocols, and laboratory environments. When removing EoS, the central challenge is maintaining predictive information on the biological target without introducing data leakage. Data leakage, in this context, is defined as any process during harmonization whereby knowledge of test-set labels (the ground-truth target variable ) is utilized—whether directly or implicitly. Standard ComBat harmonization typically preserves biological covariates by conditioning harmonization on , requiring if performed on held-out or prospective data. This leads to over-optimistic generalization and is incompatible with real-world deployment where labels are unavailable at prediction time.
PrettYharmonize is explicitly designed to answer the following: How can ComBat-style harmonization be integrated into ML pipelines such that:
- (a) all site effects (EoS) are effectively removed,
- (b) the target-related signal is preserved,
- (c) no use of true test labels is made at any harmonization step?
2. Mathematical and Algorithmic Framework
2.1 Generative Model and Notation
The data generative process is modeled as:
where:
- : value of feature for sample
- : biological target (classification or regression)
- : site of sample
- 0: global intercept
- 1: target (biological) effect (e.g., linear or spline)
- 2: additive site effect
- 3: multiplicative site-specific scaling
- 4: residual noise
2.2 Pretended Label Strategy
Instead of supplying true 5 for test samples during harmonization, PrettYharmonize constructs 6 “pretended” labels 7 for each test sample. In classification, 8 equals the number of classes and each pretended label sets 9 to each class. In regression, 0 evenly tiles the target range 1.
2.3 Harmonization Transformation
For each 2, harmonization parameters 3 are estimated using only the training set, minimizing:
4
The resulting harmonized features for sample 5 under label 6:
7
Each sample thus yields 8 harmonized versions.
2.4 Meta-Learning (Stacking) and Prediction
A stack model is trained using inner-fold cross-validation, ingesting the 9 prediction scores generated by a base model 0 applied to each harmonized version of the validation set. At test time, the stack model combines the 1 scores for each sample to produce the final prediction. Critically, no step provides 2 to the harmonization model.
PrettYharmonize Algorithm Steps
| Step | Description | Key Inputs |
|---|---|---|
| Data Split | Partition into outer train/test folds (CV) | X, Y (train), S |
| Inner-fold Stacking | Train harmonization models, base predictors, and stack model | X_train, Y_train, S_train |
| Pretended Label Harmonize | For each y*, harmonize test set using only training-fitted parameters | X_test, S_test, pretended y* |
| Meta-prediction | Stack-model aggregates scores for final prediction | Z_test (n_test×K) |
3. Experimental Design and Evaluation Protocol
3.1 Controlled Synthetic Benchmarks (MAREoS Datasets)
- Eight datasets (1,000 samples; 8 sites; 18 features).
- “True” datasets: features depend on 3 only.
- “EoS” datasets: features depend on site and target but no true signal.
- Both linear and non-linear generative processes.
- 10 pre-defined folds; “True” sets target balanced accuracy 480%; “EoS” sets 550%.
3.2 Real-World Datasets
- MRI (gray matter voxels, 3,747 features):
- Age regression and sex classification across five independent imaging datasets (AOMIC, eNKI, CamCAN, 1000Brains, SALD).
- ADNI classification: dementia vs. MCI vs. control.
- Clinical data (eICU-CRD): arterial blood gases; outcome is hospital discharge (alive/expired).
3.3 Evaluation Metrics and Protocol
- Classification: balanced accuracy (bACC), AUC, F1.
- Regression: MAE, 6, age bias (corr(true, pred-true)).
- 5-fold or 7-fold cross-validation.
- Performance reported as mean over folds.
- Paired Wilcoxon tests used for statistical comparison (not detailed in the primary source) (Nieto et al., 2024).
4. Quantitative Results and Comparative Assessment
4.1 Synthetic Benchmarks
- “True” sets: PrettYharmonize bACC comparable to random forest baseline (72–83%, difference 82%).
- “EoS” sets: bACC 9 chance (52–58%, 18–30% lower than baseline, which exploits confounded site signal).
4.2 MRI Age Regression (Site–Target Dependence)
- Unharmonized MAE: 6.20 years; PrettYharmonize MAE: 4.12; WDH: 3.82; TTL: 4.28; No-Target: 15.93.
- 0: unharm: 0.81; PrettY: 0.919; WDH: 0.925; TTL: 0.912.
Site–Target Independence
- No scheme improves significantly over unharmonized (MAE 16.31).
4.3 MRI Sex Classification (Dependence)
- Unharmonized AUC: 0.97, bACC: 92.6%.
- PrettYharmonize AUC: 0.968, bACC: 92.2%; WDH/TTL similar; No-Target bACC: 63.1%.
Site–Target Independence
- All methods perform similarly (AUC 20.92, bACC 385%).
4.4 ADNI Dementia Classification (Dependence)
- Unharm AUC: 0.813, bACC: 73.7%.
- PrettYharmonize AUC: 0.843, bACC: 77.3%; WDH/TTL similar; No-Target bACC: 60.2%.
Independence
- AUC 40.71, bACC 565–66% across schemes.
4.5 eICU-CRD Septic Patient Discharge (Dependence)
- Unharmonized AUC: 0.766; PrettYharmonize: 0.859; WDH: 0.800; TTL: 0.790; No-Target: 0.572.
Independence
- All approaches yield AUC 60.70–0.72.
5. Mechanisms of Leakage Avoidance and Theoretical Analysis
Traditional ComBat-based harmonization, when preserving biological covariate effects, necessitates inputting the true 7 for the test data—constituting data leakage. PrettYharmonize circumvents this by replacing 8 during test harmonization with multiple pretended label configurations. Each test sample is harmonized 9 times (once per 0), and the downstream ML model is applied to all versions, producing a score matrix 1 of dimension 2.
A meta-learner (the “stack model”) is then trained to convert 3 into final predictions 4. If the correct 5 matches the true label, the harmonized version retains the biological signal, so the base model 6 assigns a higher-confidence score. If 7, the true signal is distorted, lowering model confidence. Thus, the stack model learns to recognize the harmonization that best preserves predictive utility for each test instance.
A key theoretical insight is that PrettYharmonize returns bACC 8 50% on synthetic “EoS”-only datasets, indicating that it does not covertly exploit site-label confounding in the absence of real biological signal. On site–target dependent tasks, it matches leakage-prone schemes (WDH, TTL) in predictive performance, demonstrating robust leakage resistance (Nieto et al., 2024).
6. Deployment and Implications for Biomedical ML
PrettYharmonize requires only the site label (and not the true target) at prediction time. It enables leakage-free harmonization even in the presence of site-label confounds and class imbalance across sites, unlike classical ComBat-based methods that are susceptible to label leakage. This makes PrettYharmonize directly deployable in clinical or prospective ML settings.
A plausible implication is that PrettYharmonize provides a general leakage-avoidance template for harmonization of high-dimensional, site-heterogeneous biomedical data, where ground-truth labels for future or external cohorts are unavailable at the time of harmonization.
For further details, including algorithmic pseudocode and data descriptions, see (Nieto et al., 2024).