Papers
Topics
Authors
Recent
Search
2000 character limit reached

FairCauseSyn: Causal Fair Synthetic Health Data

Updated 25 February 2026
  • FairCauseSyn is a framework that uses LLMs to generate synthetic health data while explicitly modeling causal fairness.
  • It integrates causal constraints and evaluation metrics to mirror the direct, indirect, and spurious effects observed in real datasets.
  • The approach employs an iterative prompt-refinement loop, achieving notable bias reductions such as a 71% drop in direct effect bias.

FairCauseSyn is a LLM-augmented synthetic data generation framework designed to improve causal fairness in synthetic tabular health data. It specifically addresses the shortcomings of existing generative models—including generative adversarial network (GAN)-based and naive LLM-based approaches—that do not enforce or evaluate fairness with respect to complex causal structures in sensitive health datasets. By integrating explicit causal constraints and metrics into an LLM-driven synthesis pipeline, FairCauseSyn produces synthetic datasets that closely mirror the fairness properties of the real data, particularly along direct and indirect causal pathways for sensitive attributes such as sex or race (Nagesh et al., 23 Jun 2025).

1. Motivation and Problem Formulation

Synthetic data generation in healthcare aims to replicate real-world cohorts for analytics and predictive modeling while preserving privacy. Traditional generative approaches often replicate inherent group-level biases. Statistical parity or counterfactual fairness (i.e., fairness under hypothetical feature perturbations) can be insufficient, as they may ignore indirect discrimination propagated via specific mediating variables (e.g., comorbidities) or confounders. FairCauseSyn addresses this by adopting causal fairness, which involves explicit modeling of the underlying data-generating causal graph and decomposition of sensitive attribute influences into direct, indirect, and spurious effects.

Given a real dataset

Dreal={(si,xi,yi)}i=1N,\mathcal{D}_{\text{real}} = \{(s_i, x_i, y_i)\}_{i=1}^N,

with binary sensitive attribute S{s0,s1}S \in \{s_0, s_1\}, non-sensitive covariates XRdX \in \mathbb{R}^d, and clinical outcome Y{0,1}Y \in \{0, 1\}, the objective is to generate a synthetic dataset Dsyn\mathcal{D}_{\text{syn}} that preserves both utility and stringent causal fairness metrics (Nagesh et al., 23 Jun 2025).

2. Framework Architecture and Workflow

The FairCauseSyn architecture comprises four modules, orchestrated through a constraint-satisfaction loop:

  1. Data Preprocessing & Causal Graph Construction: Raw health data undergo imputation, normalization, and encoding. The structural causal model (SCM) is constructed over SS (sensitive attribute), ZZ (confounders), WW (mediators), and YY (outcome).
  2. Causal Fairness Evaluation on Real Data: Baseline causal fairness metrics—total effect (TE), direct effect (@@@@10@@@@), indirect effect (IE), and spurious effect (SE)—are estimated on the real cohort using Monte Carlo interventions on the SCM.
  3. LLM-Augmentation & Synthetic Data Generation: A subset of causally representative data is used to construct prompts that specify schemas and causal constraints. The LLM synthesizes candidate batches, and each is evaluated for adherence to fairness metrics.
  4. Post-processing & Predictive Modeling: Upon satisfaction of fairness constraints, the synthetic data is processed for downstream predictive modeling tasks.

Workflow summary in compact pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Input:  D_real, causal graph G = (SWY, SY, Z{W,Y}, ZS)
Output: D_syn

1.  Preprocess(D_real)  D_proc
2.  Compute causal metrics(TV, DE, IE, SE) on D_proc
3.  Prompt  BuildPrompt(D_proc  examples, schema)
4.  repeat
5.    D_cand  QueryLLM(Prompt)
6.    metrics_cand  EvalCausal(D_cand, G)
7.    if MetricsOK(metrics_cand) then
8.       D_syn  D_cand; break
9.    else
10.      Prompt  RefinePrompt(Prompt, diagnostics)
11. until timeout
12. Postprocess(D_syn)
13. return D_syn

3. Causal Fairness: Definitions and Measurement

The framework employs Pearl’s do-calculus and the Standard Fairness Model for effect decomposition:

  • Total Effect (TE):

TE=P(Ydo(S=s1))P(Ydo(S=s0))\mathrm{TE} = P(Y \mid do(S = s_1)) - P(Y \mid do(S = s_0))

  • Direct Effect (DE) (fixing mediators WW at their baseline under s0s_0):

DE=P(Ydo(S=s1,W=ws0))P(Ydo(S=s0,W=ws0))\mathrm{DE} = P(Y \mid do(S = s_1, W = w_{s_0})) - P(Y \mid do(S = s_0, W = w_{s_0}))

  • Indirect Effect (IE) (mediated, fixing SS but switching WW as if S=s1S = s_1):

IE=P(Ydo(S=s0,W=ws1))P(Ydo(S=s0,W=ws0))\mathrm{IE} = P(Y \mid do(S = s_0, W = w_{s_1})) - P(Y \mid do(S = s_0, W = w_{s_0}))

  • Spurious Effect (SE) and Total Variation (TV) are further defined (following Plečko et al. 2024):

SEx0,x1(y)=P(yx0x1)P(yx0),TVx0,x1=DEIESE\mathrm{SE}_{x_0, x_1}(y) = P(y_{x_0} \mid x_1) - P(y \mid x_0), \quad TV_{x_0, x_1} = DE - IE - SE

Estimation is performed using Monte Carlo interventions replicating interventional distributions on both real and synthetic data, with close alignment indicating successful structural and fairness preservation.

4. LLM-Augmentation Loop, Prompting, and Optimization

Data curation involves selecting a causally representative compact subset of preprocessed records. Prompt engineering encodes both feature schemas and explicit causal fairness constraints.

The prompt-based synthetic data generation is structured as an iterative loop. Each batch generated by the LLM is evaluated for causal metrics (DEDE, IEIE, SESE) against tight thresholds (ϵDE\epsilon_{DE}, ϵIE\epsilon_{IE}, ϵSE\epsilon_{SE}). If the batch fails, the prompt is refined by adding new exemplars or explicit instructions, or both. The process can be interpreted as implicit prompt-parameter optimization (ϕ\phi) to minimize the composite loss:

L(ϕ)=xDreallogPLLMϕ(x)+λ[DEsyn+IEsyn+SEsyn]\mathcal{L}(\phi) = -\sum_{x \in D_{\text{real}}} \log P_{\text{LLM}_{\phi}}(x) + \lambda[ |DE_{\text{syn}}| + |IE_{\text{syn}}| + |SE_{\text{syn}}| ]

In contrast to model parameter fine-tuning, constraint satisfaction is enforced by prompt refreshment rather than gradient-based updates.

5. Model Objective, Data Generation, and Post-processing

The LLM is taken as the core generative model: prompt design is the primary mode of tuning. The joint loss for the pipeline is formulated as:

L(θ)=ExDproc[logPθ(x)]Lgen(θ)+λ  TEsynTEreal+DEsyn+IEsyn+SEsynLfair(θ)\mathcal{L}(\theta) = \underbrace{ \mathbb{E}_{x \sim D_{\text{proc}}} \bigl[ -\log P_\theta(x) \bigr] }_{\mathcal{L}_{\text{gen}}(\theta)} + \lambda \; \underbrace{ |TE_{\text{syn}} - TE_{\text{real}}| + |DE_{\text{syn}}| + |IE_{\text{syn}}| + |SE_{\text{syn}}| }_{\mathcal{L}_{\text{fair}}(\theta)}

with λ>0\lambda > 0 controlling the trade-off. In practice, this is realized as a sequence of hard thresholded constraints via the prompt-refinement loop.

6. Empirical Evaluation and Results

Experiments are performed on the Heart Failure Clinical Records dataset (Chicco & Jurman 2020) with N=299N = 299 records. The sensitive attribute SS is sex, confounder ZZ is age, mediators WW comprise seven clinical measurements (including anaemia, diabetes, blood pressure, ejection fraction, creatinine, sodium, platelets, smoking), and outcome YY is survival (0/1). Preprocessing includes imputing missing values, normalization, and one-hot encoding.

Evaluation proceeds at the data and model levels:

  1. Causal fairness metrics (TV, DE, IE, SE) on real and synthetic data.
  2. Downstream prediction with and without fairness constraints (Random Forest, FairAdapt debiaser).

Table: Causal Fairness Metrics (Real vs. Synthetic Data)

Evaluation Setting TV DE IE SE
Data Fairness – Real –0.0121±0.0537 –0.0477±0.0026 –0.0472±0.0068 0.0116±0.0556
Data Fairness – Synthetic –0.0492±0.0571 –0.0429±0.0043 –0.0002±0.0072 0.0064±0.0580
Fair Model – Real 0.0248±0.0631 –0.0070±0.0016 –0.0538±0.0054 0.0219±0.0637
Fair Model – Synthetic 0.0003±0.0568 –0.0020±0.0030 0.0076±0.0108 –0.0099±0.0556

Key results:

  • Synthetic data’s TV, DE, IE remain within 10% deviation of real data (|Δ| < 0.005).
  • Training with a causally fair predictor yields a |DE| drop from 0.0070 to 0.0020 (∼71% reduction in bias).

7. Implications, Challenges, and Prospective Directions

FairCauseSyn marks the first integration of causal fairness constraints in LLM-based tabular synthesis for health data. Empirically, it preserves essential causal metrics (TV, DE, IE within 10% error) and, when coupled with a causally fair downstream model, achieves over 70% reduction in direct effect bias relative to real data.

A noted limitation is the higher variance observed in the spurious effect (SE) on synthetic datasets. This suggests prompt-based constraint enforcement alone does not completely eliminate bias propagated through confounders. Future research will likely involve hybrid approaches that jointly leverage SCM parameter learning and prompt optimization to further attenuate spurious pathways, with extensions to encompass multi-group sensitive attributes and continuous interventional settings.

FairCauseSyn demonstrates that LLM-augmented synthesis, when carefully constrained by causal fairness principles, can provide high-utility and low-bias synthetic health data, supporting more robust and equitable downstream healthcare analytics (Nagesh et al., 23 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FairCauseSyn Framework.