Social Counterfactuals Dataset

Updated 21 November 2025

Social Counterfactuals Datasets are structured collections that systematically vary social attributes to assess model bias using controlled counterfactual interventions.
They employ localized inpainting and attribute-conditional prompting in vision and NLP to alter only protected characteristics while preserving semantic context.
These datasets support bias profiling and mitigation by enabling evaluations with metrics like FID, equality of opportunity, and counterfactual token fairness.

Social counterfactuals datasets are structured collections of data—often image-caption or text pairs—that systematically vary social attributes (such as gender, race, or group identity) in order to probe, diagnose, and mitigate algorithmic social bias in machine learning models. These datasets enable controlled counterfactual interventions on protected attributes while holding other semantic content constant, forming the empirical basis for quantifying and addressing model sensitivity to societal identity variables across computer vision, vision–language, and NLP tasks.

1. Foundational Concepts and Motivations

The primary motivation for social counterfactuals datasets derives from the observation that foundation models, trained on large-scale web-scraped or organic corpora, encode and propagate societal biases through spurious correlations between social group attributes and semantic roles or labels. Social counterfactual evaluation, in which one compares model outputs on examples that are identical except for group-relevant tokens or pixels, is a canonical technique for quantifying and mitigating such inductive biases.

In computer vision, counterfactual image datasets enable localized interventions (e.g., changing perceived gender while leaving scene context unchanged), facilitating both accurate bias profiling and targeted data augmentation for debiasing (Sirotkin et al., 12 Dec 2024, Howard et al., 2023). In text classification, systematically swapping identity tokens and filtering for sentence plausibility allows controlled measurement of “counterfactual token fairness” and the impact of social-group references on classifier behavior (Davani et al., 2020).

2. Representative Datasets and Their Construction

Social counterfactuals datasets span both vision-language and NLP domains, distinguished by their methods for generating counterfactual variants and controlling for semantic invariance.

Pinpoint Counterfactuals (Sirotkin et al., 12 Dec 2024) is based on the Conceptual Captions (CC3M) dataset, targeting images containing people (606,041 instances out of 2.13M) and generating two localized counterfactuals (“man” and “woman”) for each, for 1,818,123 person-containing samples in total. Counterfactual generation involves (a) automated person and skin segmentation (ResNet-101), (b) BrushNet in-painting driven by masked, attribute-specific textual prompts, and (c) keyword-level caption editing using bijective mappings for gendered terms.
SocialCounterfactuals (Howard et al., 2023) generalizes counterfactual generation for probing intersectional biases in vision–LLMs (VLMs). Images and captions are synthesized using Stable Diffusion with cross-attention control, targeting intersections of race (6 classes), gender (2), and body traits (5) across 158 occupations. Prompts are crafted to isolate attribute variation, and cross-attention masking ensures image generation holds all non-attribute semantics constant. The pipeline over-generates and filters for image–caption alignment, mutual image similarity, attribute detectability using CLIP, and non-NSFW content, yielding 170,832 high-quality image–text pairs grouped into 13,824 counterfactual sets.

Textual Social Counterfactuals (Davani et al., 2020) are created by enumerating all possible single-token substitutions for a set of 77 social-group tokens in hate-speech corpora (e.g., Gab Hate Corpus, Stormfront). For each sentence with exactly one social-group token, all possible substitutions yield up to 76 counterfactuals. A pretrained GPT-2 LLM scores both original and substituted sentences; only those substitutions that do not reduce sentence plausibility (measured as total log-likelihood) are retained (“symmetric” counterfactuals). On average, 10–15% of candidates pass the plausibility filter.

3. Methodological Frameworks and Filtering

Localized Masking and Inpainting: In vision datasets (Sirotkin et al., 12 Dec 2024), attribute edits are restricted to semantically relevant image regions via person and skin segmentation followed by guided diffusion-based inpainting. This contrasts with global synthetic re-generation which can arbitrarily alter scene context, leading to artifacts.
Attribute-Conditional Prompting and Caption Editing: Prompts expressly encode only the intended social attribute change, and captions are automatically edited by mapping group-relevant terms across attributes. Bijective lists for gendered tokens ensure systematic coverage (Sirotkin et al., 12 Dec 2024, Howard et al., 2023).
Over-Generation and High-Precision Filtering: Both text (Davani et al., 2020) and image (Howard et al., 2023) datasets implement multi-stage filtering. In vision, CLIP-based metrics enforce semantic similarity and attribute detectability; NSFW classifiers and manual checks further refine the dataset. In NLP, GPT-2 scoring eliminates substitutions that decrease syntactic or semantic plausibility.

Dataset/Domain	Attribute Axes	Counterfactual Generation
Pinpoint (Sirotkin et al., 12 Dec 2024)	Gender (man/woman)	Masked BrushNet inpainting, caption mapping
SocialCF (Howard et al., 2023)	Race, gender, physique	Stable Diffusion (cross-attention mask), prompt engineering, CLIP filter
FairHS (Davani et al., 2020)	77 social tokens	String substitution, GPT-2 likelihood filter

4. Evaluation Metrics and Empirical Profiles

Fidelity and bias mitigation performance in social counterfactuals datasets are assessed using an array of quantitative and diagnostic metrics:

Fidelity Metrics: For vision, Human Preference Score (HPS), Aesthetic Score (AS), Image Reward (IR), Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Conditional MMD (CMMD) compare the realism and distributional alignment of counterfactuals versus real samples (Sirotkin et al., 12 Dec 2024). Table 1 in (Sirotkin et al., 12 Dec 2024) indicates that localized inpainting yields the lowest FID/KID and only slight reductions in aesthetic scores relative to real images, outperforming prior fully synthetic baselines.
Bias Metrics in Vision-LLMs:
- Person Preference (PP): Quantifies the model’s tendency to attend to “person” versus specific gender semantics.
- Self-Similarity (“Markedness”): Measures intra-group representation similarity.
- Equality of Opportunity (Δ_EoO): Reports true positive rate disparities on occupation-specific gender splits.
- Gender Recall Disparity (Δ_recall): Quantifies differences in recall for different gender representations (Sirotkin et al., 12 Dec 2024).

In (Howard et al., 2023), MaxSkew@K measures intersectional retrieval skew across attribute combinations. On real-world datasets (VisoGender, PATA), fine-tuning with counterfactual data reduces MaxSkew by 12–15%.

NLP Fairness and Use Metrics:
- Counterfactual Token Fairness (CTF): Binary indicator of classifier output invariance across counterfactual pairs (Davani et al., 2020).
- Δ(f; x, x′): Absolute difference in classifier output between a sentence and its counterfactual.
- Equality of Odds: Standard deviation of TPR/TNR across social groups in hate speech detection.

5. Applications in Bias Profiling and Mitigation

Social counterfactuals enable rigorous profiling of model biases and serve as direct or augmentation sources for downstream debiasing:

Vision-Language Fine-Tuning: Replacing or augmenting training data with systematically generated, balanced counterfactuals (e.g., up to 100% replacement for person images) allows control over demographic representation without degrading general zero-shot performance. Fine-tuning on hybrid splits yields optimal reduction in group and intersectional disparities, e.g., reducing Δ_self-sim and Δ_EoO by 30–60% in quantized experiments (Sirotkin et al., 12 Dec 2024, Howard et al., 2023).
Hate Speech Detection: Regularizing classifiers on symmetric counterfactuals via loss pairing (penalty on logit distance between (x, x′)) increases output invariance to group tokens without degrading primary classification accuracy (Davani et al., 2020).
Benchmarking and Diagnostic Use: These datasets are used to expose model shortcuts and biases in both research and evaluation contexts. Counterfactual-based probes constitute best practice for intersectional audits where naturally occurring data is insufficiently balanced (Howard et al., 2023).

6. Limitations, Challenges, and Prospects

Current social counterfactuals datasets exhibit several structural and methodological constraints:

Attribute Coverage: Most datasets address a limited set of protected attributes—typically binary gender, US-centric races, or discrete occupation sets. Coverage of religion, age, non-binary gender, and global taxonomies is limited or absent (Howard et al., 2023, Sirotkin et al., 12 Dec 2024).
Synthetic Artifacts and Limitations: Diffusion models occasionally fail to faithfully depict minority combinations or complex intersectional identities; inpainting methods may leave subtle artifacts; text substitutions may not always yield grammatically coherent counterfactuals (Davani et al., 2020, Sirotkin et al., 12 Dec 2024, Howard et al., 2023).
Filtering Reliance on Pretrained Models: Filtering using GPT-2 or CLIP encoders introduces dependence on the preexisting biases of those models, which may affect counterfactual selection (Davani et al., 2020, Howard et al., 2023).
Representation and Generalization: Textual social counterfactuals struggle with multi-word or polysemous tokens, morphological agreement, and are specific to underlying corpora (e.g., Gab, CC3M) (Davani et al., 2020).

Future directions outlined in (Howard et al., 2023) and (Sirotkin et al., 12 Dec 2024) include extension to additional social attributes (e.g., pronouns, broader age groups), the inclusion of more nuanced or non-binary group representations, global taxonomies, and human-in-the-loop semantic validation pipelines for increased annotation reliability.

7. Accessibility and Usage Guidance

Social counterfactuals datasets are disseminated under open-access licenses, co-published with code for reproducing counterfactual generation and evaluation pipelines. Specific access points include:

Pinpoint Counterfactuals: https://github.com/vpu-lab/pinpoint_counterfactuals (CC-BY-4.0, 1.2M in-painted images plus metadata) (Sirotkin et al., 12 Dec 2024).
SocialCounterfactuals: https://huggingface.co/datasets/Intel/SocialCounterfactuals (170K image–caption pairs, reproducible with provided code) (Howard et al., 2023).
NLP Social Counterfactuals: Induced on-data via provided token lists and filtering code (Davani et al., 2020).

Usage recommendations emphasize coupling real and counterfactual splits for diagnosis, combining aesthetic and distributional metrics for evaluation, and extending the generation pipeline to new domains and group attributes as needed for broader fairness auditing.

References:

“Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation” (Sirotkin et al., 12 Dec 2024); “SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-LLMs with Counterfactual Examples” (Howard et al., 2023); “Fair Hate Speech Detection through Evaluation of Social Group Counterfactuals” (Davani et al., 2020).