FairCauseSyn: Causal Fair Synthetic Health Data
- FairCauseSyn is a framework that uses LLMs to generate synthetic health data while explicitly modeling causal fairness.
- It integrates causal constraints and evaluation metrics to mirror the direct, indirect, and spurious effects observed in real datasets.
- The approach employs an iterative prompt-refinement loop, achieving notable bias reductions such as a 71% drop in direct effect bias.
FairCauseSyn is a LLM-augmented synthetic data generation framework designed to improve causal fairness in synthetic tabular health data. It specifically addresses the shortcomings of existing generative models—including generative adversarial network (GAN)-based and naive LLM-based approaches—that do not enforce or evaluate fairness with respect to complex causal structures in sensitive health datasets. By integrating explicit causal constraints and metrics into an LLM-driven synthesis pipeline, FairCauseSyn produces synthetic datasets that closely mirror the fairness properties of the real data, particularly along direct and indirect causal pathways for sensitive attributes such as sex or race (Nagesh et al., 23 Jun 2025).
1. Motivation and Problem Formulation
Synthetic data generation in healthcare aims to replicate real-world cohorts for analytics and predictive modeling while preserving privacy. Traditional generative approaches often replicate inherent group-level biases. Statistical parity or counterfactual fairness (i.e., fairness under hypothetical feature perturbations) can be insufficient, as they may ignore indirect discrimination propagated via specific mediating variables (e.g., comorbidities) or confounders. FairCauseSyn addresses this by adopting causal fairness, which involves explicit modeling of the underlying data-generating causal graph and decomposition of sensitive attribute influences into direct, indirect, and spurious effects.
Given a real dataset
with binary sensitive attribute , non-sensitive covariates , and clinical outcome , the objective is to generate a synthetic dataset that preserves both utility and stringent causal fairness metrics (Nagesh et al., 23 Jun 2025).
2. Framework Architecture and Workflow
The FairCauseSyn architecture comprises four modules, orchestrated through a constraint-satisfaction loop:
- Data Preprocessing & Causal Graph Construction: Raw health data undergo imputation, normalization, and encoding. The structural causal model (SCM) is constructed over (sensitive attribute), (confounders), (mediators), and (outcome).
- Causal Fairness Evaluation on Real Data: Baseline causal fairness metrics—total effect (TE), direct effect (@@@@10@@@@), indirect effect (IE), and spurious effect (SE)—are estimated on the real cohort using Monte Carlo interventions on the SCM.
- LLM-Augmentation & Synthetic Data Generation: A subset of causally representative data is used to construct prompts that specify schemas and causal constraints. The LLM synthesizes candidate batches, and each is evaluated for adherence to fairness metrics.
- Post-processing & Predictive Modeling: Upon satisfaction of fairness constraints, the synthetic data is processed for downstream predictive modeling tasks.
Workflow summary in compact pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Input: D_real, causal graph G = (S→W→Y, S→Y, Z→{W,Y}, Z↔S) Output: D_syn 1. Preprocess(D_real) → D_proc 2. Compute causal metrics(TV, DE, IE, SE) on D_proc 3. Prompt ← BuildPrompt(D_proc ⊂ examples, schema) 4. repeat 5. D_cand ← QueryLLM(Prompt) 6. metrics_cand ← EvalCausal(D_cand, G) 7. if MetricsOK(metrics_cand) then 8. D_syn ← D_cand; break 9. else 10. Prompt ← RefinePrompt(Prompt, diagnostics) 11. until timeout 12. Postprocess(D_syn) 13. return D_syn |
3. Causal Fairness: Definitions and Measurement
The framework employs Pearl’s do-calculus and the Standard Fairness Model for effect decomposition:
- Total Effect (TE):
- Direct Effect (DE) (fixing mediators at their baseline under ):
- Indirect Effect (IE) (mediated, fixing but switching as if ):
- Spurious Effect (SE) and Total Variation (TV) are further defined (following Plečko et al. 2024):
Estimation is performed using Monte Carlo interventions replicating interventional distributions on both real and synthetic data, with close alignment indicating successful structural and fairness preservation.
4. LLM-Augmentation Loop, Prompting, and Optimization
Data curation involves selecting a causally representative compact subset of preprocessed records. Prompt engineering encodes both feature schemas and explicit causal fairness constraints.
The prompt-based synthetic data generation is structured as an iterative loop. Each batch generated by the LLM is evaluated for causal metrics (, , ) against tight thresholds (, , ). If the batch fails, the prompt is refined by adding new exemplars or explicit instructions, or both. The process can be interpreted as implicit prompt-parameter optimization () to minimize the composite loss:
In contrast to model parameter fine-tuning, constraint satisfaction is enforced by prompt refreshment rather than gradient-based updates.
5. Model Objective, Data Generation, and Post-processing
The LLM is taken as the core generative model: prompt design is the primary mode of tuning. The joint loss for the pipeline is formulated as:
with controlling the trade-off. In practice, this is realized as a sequence of hard thresholded constraints via the prompt-refinement loop.
6. Empirical Evaluation and Results
Experiments are performed on the Heart Failure Clinical Records dataset (Chicco & Jurman 2020) with records. The sensitive attribute is sex, confounder is age, mediators comprise seven clinical measurements (including anaemia, diabetes, blood pressure, ejection fraction, creatinine, sodium, platelets, smoking), and outcome is survival (0/1). Preprocessing includes imputing missing values, normalization, and one-hot encoding.
Evaluation proceeds at the data and model levels:
- Causal fairness metrics (TV, DE, IE, SE) on real and synthetic data.
- Downstream prediction with and without fairness constraints (Random Forest, FairAdapt debiaser).
Table: Causal Fairness Metrics (Real vs. Synthetic Data)
| Evaluation Setting | TV | DE | IE | SE |
|---|---|---|---|---|
| Data Fairness – Real | –0.0121±0.0537 | –0.0477±0.0026 | –0.0472±0.0068 | 0.0116±0.0556 |
| Data Fairness – Synthetic | –0.0492±0.0571 | –0.0429±0.0043 | –0.0002±0.0072 | 0.0064±0.0580 |
| Fair Model – Real | 0.0248±0.0631 | –0.0070±0.0016 | –0.0538±0.0054 | 0.0219±0.0637 |
| Fair Model – Synthetic | 0.0003±0.0568 | –0.0020±0.0030 | 0.0076±0.0108 | –0.0099±0.0556 |
Key results:
- Synthetic data’s TV, DE, IE remain within 10% deviation of real data (|Δ| < 0.005).
- Training with a causally fair predictor yields a |DE| drop from 0.0070 to 0.0020 (∼71% reduction in bias).
7. Implications, Challenges, and Prospective Directions
FairCauseSyn marks the first integration of causal fairness constraints in LLM-based tabular synthesis for health data. Empirically, it preserves essential causal metrics (TV, DE, IE within 10% error) and, when coupled with a causally fair downstream model, achieves over 70% reduction in direct effect bias relative to real data.
A noted limitation is the higher variance observed in the spurious effect (SE) on synthetic datasets. This suggests prompt-based constraint enforcement alone does not completely eliminate bias propagated through confounders. Future research will likely involve hybrid approaches that jointly leverage SCM parameter learning and prompt optimization to further attenuate spurious pathways, with extensions to encompass multi-group sensitive attributes and continuous interventional settings.
FairCauseSyn demonstrates that LLM-augmented synthesis, when carefully constrained by causal fairness principles, can provide high-utility and low-bias synthetic health data, supporting more robust and equitable downstream healthcare analytics (Nagesh et al., 23 Jun 2025).