A-FairCLIP: Aligned Fairness in Vision-Language

Updated 15 September 2025

The paper presents an aligned implementation of FairCLIP that uses Sinkhorn distance to balance image-text similarity distributions across sensitive subgroups.
It integrates a fairness regularizer into the standard CLIP framework without batch-dependent normalizations, ensuring reproducible evaluation in medical zero-shot tasks.
Empirical findings reveal that despite reduced Sinkhorn distances, consistent improvements in group fairness and diagnostic performance remain inconclusive.

A-FairCLIP refers to an “aligned” implementation of FairCLIP, designed to address the discrepancies between the theoretical model description and the original experimental code for FairCLIP’s fairness regularization in vision-LLMs. The model and its variants were specifically developed to clarify whether optimal transport–based fairness objectives, particularly those implemented using the Sinkhorn distance on image-text similarity scores, meaningfully improve fairness and diagnostic performance in zero-shot medical classification tasks such as glaucoma diagnosis.

1. Model Design and Fairness Regularization

A-FairCLIP, along with FairCLIP, extends the standard CLIP contrastive learning framework by integrating an additional fairness regularizer. This regularizer aims to align the similarity score distributions between image-text pairs across sensitive subgroups (attributes such as race, gender, ethnicity, and language) with the overall distribution for the population. For a protected attribute $\alpha \in \mathcal{A}$ and a batch of image-text pairs, the regularizer is defined as:

$\mathcal{L}_{Fair} = \sum_{\alpha \in \mathcal{A}} d(\mathcal{D}_{(I,T)}, \mathcal{D}_{(I,T|a=\alpha)})$

where $d(\cdot, \cdot)$ is the Sinkhorn distance, $\mathcal{D}_{(I,T)}$ is the empirical distribution of diagonal image-text similarities (cosine similarity or dot product for each pair), and $\mathcal{D}_{(I,T|a=\alpha)}$ is the corresponding distribution restricted to samples in group $\alpha$ . The Sinkhorn distance is computed as:

$d(\mathcal{D}_{\mathcal{B}}, \mathcal{D}_{\mathcal{B}_\alpha}) = \inf_{z}\left[\mathbb{E}_{(x,y)\sim z}\left[c(x,y)\right] + \varepsilon H\left(z ~|~ \mathcal{D}_{\mathcal{B}} \otimes \mathcal{D}_{\mathcal{B}_\alpha}\right)\right]$

where $c(\cdot, \cdot)$ is a cost function, $H(\cdot|\cdot)$ is the entropy regularizer, and the infimum is taken over all joint distributions $z$ with the prescribed marginals.

The overall objective for A-FairCLIP is:

$\mathcal{L}_{A-FairCLIP} = \mathcal{L}_{CLIP} + \lambda \mathcal{L}_{Fair}$

where $\lambda$ is a tunable regularization hyperparameter. This formulation is designed to strictly follow the original theoretical proposal without extraneous normalizations or batch-dependent score manipulations, as previously observed in the first FairCLIP code releases.

2. Improvements Over the Original FairCLIP Implementation

A-FairCLIP was developed in response to mismatches between the published FairCLIP description and its implementation. The core design differences are:

Score Calculation: A-FairCLIP directly utilizes diagonal similarity scores from the paired image and text embeddings: $diag(\Phi_I^T \Phi_T)$ , rather than normalizing scores per sensitive group or using batch-dependent matrix operations.
Validation Protocol: A-FairCLIP selects model and hyperparameters based on validation performance, not test-set performance, to ensure unbiased evaluation.
Hyperparameter Optimization: Automated tools (e.g., Optuna) search for optimal learning rates and regularization weights, so the fairness regularizer (via Sinkhorn distance) is correctly balanced against the main CLIP objective.
Extension to Multiple Attributes: The FairCLIP+ variant combines fairness losses over multiple sensitive attributes using learned, non-negative weights that sum to one.

These changes ensure that the fairness regularizer operates as intended in the theoretical framework and that reported results accurately reflect generalization across groups.

3. Experimental Methodology: Zero-Shot Glaucoma Classification

The evaluation was conducted using the Harvard-FairVLMed dataset, which contains paired scanning laser ophthalmoscopy (SLO) fundus images, clinical notes (summarized by GPT-4), and detailed demographic annotations.

Task Setup: Zero-shot classification is performed by encoding the fundus images and diagnostic text queries (e.g., “A photo of glaucoma” vs. “A photo of non-glaucoma”). Each prediction is determined by the highest similarity score between image and text features.
Performance Metrics:
- AUC (Area under the ROC curve)
- ES-AUC (Equity-Scaled AUC)
Fairness Metrics:
- DPD (Demographic Parity Distance)
- DEOdds (Difference in Equalized Odds)
- Group-wise AUC (for sensitive attribute stratifications)
Sinkhorn Distance Monitoring: For each sensitive attribute, the Sinkhorn distance between subgroup and overall similarity score distributions is tracked to verify the effectiveness of the fairness regularizer.

A-FairCLIP and FairCLIP+ are directly compared to standard fine-tuned CLIP (CLIP-FT) models to determine whether regularization yields measurable gains in group fairness or overall performance.

4. Empirical Findings on Fairness and Utility

Contrary to expectations set by some earlier FairCLIP reports, A-FairCLIP does not consistently improve fairness metrics or classification performance relative to CLIP-FT. Major findings include:

Sinkhorn Distance Reduction: As designed, the fairness regularizer in both A-FairCLIP and FairCLIP+ reliably minimizes the Sinkhorn distance between subgroup and population distributions.
No Reliable Fairness Gain: Despite reduced distances, group fairness metrics (DPD, DEOdds) do not consistently improve. In some settings, gender-based fairness improves while racial group fairness worsens.
Performance Variability: AUC and ES-AUC outcomes are sometimes less favorable for A-FairCLIP compared to the baseline, and standard deviations are high—suggesting instability in the fairness-performance tradeoff.
FairCLIP+ Multi-Attribute Extension: FairCLIP+ enforces regularization across multiple protected attributes simultaneously. Nevertheless, improvements observed in Sinkhorn distances do not translate into statistically significant gains in fairness or accuracy over CLIP-FT.
Validation on Two Medical Datasets: Results do not support claims that the optimal transport regularizer enhances fairness and utility in zero-shot medical diagnosis, at least in the glaucoma task as implemented.

A plausible implication is that regularization strategies that strictly minimize similarity discrepancies (via Sinkhorn distance) may not be sufficient on their own to guarantee improved group fairness or diagnostic performance in practical downstream tasks.

5. Technical Table: Comparison of Methodological Choices

Aspect	FairCLIP (Original Impl.)	A-FairCLIP (Aligned)	FairCLIP+ (Multi-Attr.)
Score Calculation	Matrix product, sum-norm.	Direct diagonal dot products	Direct diagonal, multi-attr.
Validation protocol	Test set	Validation set	Validation set
Batch dependency	Present	Absent	Absent
Sinkhorn regularizer	Yes	Yes	Yes (weighted sum)
Fairness improvement	Inconclusive	Inconclusive	Inconclusive
Performance impact	Variable	Variable, high std. dev.	Variable

The table clarifies that, while implementation details affect empirical behavior, none of the methods consistently outperform conventional fine-tuned CLIP with respect to both fairness and utility.

6. Interpretation and Outlook

The reproducibility analysis by the authors (Bakker et al., 8 Sep 2025) demonstrates that strictly following the theoretical FairCLIP regularization results in intended reductions in inter-group similarity divergence, but these reductions do not reliably produce fairer or more performant diagnostic models in medical zero-shot tasks. This suggests that minimizing a specific statistical distance (Sinkhorn or others) may be necessary but not sufficient for practical fairness improvements in vision-language settings. A plausible implication is that future fairness interventions should be developed with greater attention to task-level outcome metrics, subgroup sample sizes, and interplay between regularization and model selection criteria.

No evidence was found that optimal transport regularization, even when precisely implemented, robustly enhances fairness or downstream diagnostic accuracy; hence, claims regarding “favorable trade-offs” must be viewed with caution and reevaluated in the context of empirical reproducibility studies.

PDF Markdown Chat (Pro)

References (1)

On the Reproducibility of "FairCLIP: Harnessing Fairness in Vision-Language Learning'' (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to A-FairCLIP.