FairFace Dataset: Balanced Face Attributes

Updated 15 November 2025

FairFace is a balanced face dataset featuring detailed race, gender, and age annotations to mitigate racial imbalances in earlier corpora.
It employs adaptive sampling and consensus-based crowdsourced labeling, using tools like dlib and ResNet-34 to ensure high-quality annotations.
Empirical evaluations show that models trained on FairFace achieve higher overall and subgroup accuracies with reduced bias compared to legacy datasets.

FairFace is a publicly available, large-scale face attribute dataset designed to mitigate the systematic racial imbalances present in earlier open face-image corpora, with a particular emphasis on balanced representation across race, gender, and age. It is widely adopted for benchmarking fairness, robustness, and bias mitigation strategies in face-analytic algorithms. The dataset’s curation protocols, intersectional composition, annotation methodologies, and empirical outcomes are documented by Karkkainen & Joo (2019) (Kärkkäinen et al., 2019) and audited in subsequent fairness research (Bahiru et al., 17 Oct 2025, Dong et al., 11 Oct 2025).

1. Dataset Composition and Curation

FairFace comprises approximately 108,501 real-world face images sampled primarily from Yahoo’s YFCC-100M Flickr collection, all under Creative-Commons permissive licenses. The dataset was engineered to ensure balanced subgroup representation, explicitly moving beyond the Caucasian/White over-representation typical of prior sets. The race taxonomy spans seven classes—White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino—supplemented where necessary by other sources such as Twitter and newspaper images.

Faces were detected using dlib’s max-margin CNN detector with a minimum bounding box of 50×50 pixels. Adaptive sampling was deployed throughout corpus construction: once over-represented race groups (notably White faces) reached quota, images from those countries were downsampled, and enrichment focused on underrepresented regions (South Asia, Middle East, SE Asia, Latin America).

Annotations include three axes:

Race: Seven classes (White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, Latino).
Gender: Binary (“male,” “female”).
Age: Four bins ([0–9], [10–29], [30–49], [50+]), with more granular deciles introduced in fairness audits (Bahiru et al., 17 Oct 2025).

The intersectional structure leads to hundreds of “cells” (each representing a unique race × gender × age bin); for audits, configurations such as 2 genders × 9 races × 10 age bins = 180 cells are applied.

Category	# Images	% of Total
White	15,600	14.4%
Black	15,500	14.3%
Indian	15,550	14.3%
East Asian	15,500	14.3%
Southeast Asian	15,500	14.3%
Middle East	15,550	14.3%
Latino	15,801	14.6%
Male	54,300	50.1%
Female	54,201	49.9%
Age 0–9	24,300	22.4%
Age 10–29	30,000	27.7%
Age 30–49	29,500	27.2%
Age 50+	24,701	22.7%

Approximately uniform representation is achieved within each major category.

2. Annotation Protocol and Quality Control

Annotation for race, gender, and age followed a consensus-based crowdsourced protocol. Each face was labeled by three independent Amazon Mechanical Turk raters. A ≥2 majority accepted the label; persistent disagreement resulted in further review or image removal. To refine and flag outliers, a preliminary ResNet-34 classifier was trained on initial labels, surfacing images with high classification uncertainty or inter-rater discrepancy, which then received expert relabeling.

The labeling pipeline led to fewer than 2% image removal due to irreconcilable annotation discord. While the method did not directly report inter-annotator agreement statistics, the majority-vote scheme and subsequent machine-assisted correction reduced visible label noise. This suggests that annotation reliability was controlled within practical operational bounds for large-scale training and evaluation.

3. Stratification, Partitioning, and Dataset Usage

FairFace is released as a unified set without mandatory train/val/test splits; users are advised to stratify partitions along race, gender, and age axes to preserve class proportions. The recommended procedure is to allocate samples within each demographic bucket to training (≈80%), validation (≈10%), and testing (≈10%).

Random stratified sampling is encouraged, especially for fairness benchmarking and external validation of face-attribute models. A plausible implication is that downstream model fairness is maximized when partitioning strictly enforces subgroup representation even at the split level, not just in the global dataset.

4. Model Training, Evaluation, and Fairness Metrics

Canonical FairFace modeling uses ResNet-34 backbone with Adam optimization (learning rate $1\times 10^{-4}$ ). Standard cross-entropy loss is adopted for race, gender, and age classification: $L_{\text{CE}} = -\sum_i y_i \log p_i$ where $y_i$ is one-hot ground truth and $p_i$ is predicted probability.

Evaluation metrics include:

Overall accuracy: Fraction of correctly classified samples.
Per-group accuracy: Accuracy within each demographic group.
Balanced accuracy over $G$ groups: $\text{Balanced Accuracy} = \frac{1}{G}\sum_{g=1}^{G} \frac{TP_g}{TP_g + FN_g}$
Maximum accuracy disparity ( $\epsilon$ ): $\epsilon(\hat Y) = \max_{j,k\in D} \left| \log \frac{P(\hat Y = Y \mid A = j)}{P(\hat Y = Y \mid A = k)} \right|$ where $A$ indexes groups.

Conditional use accuracy equality (an equalized odds constraint): $P(\hat Y = i \mid Y = i, A = j) = P(\hat Y = i \mid Y = i, A = k),\quad \forall i, j, k$

5. Empirical Fairness Outcomes and Benchmark Results

Models trained on FairFace consistently outperform those trained on legacy sets—both in mean accuracy and subgroup variance—across external validation cohorts. In held-out cross-domain tests (Twitter, newspapers, protest images):

Race classification accuracy: FairFace (full) 81.5%, UTKFace 68.4%, LFWA+ 68.4%
Gender classification accuracy: FairFace 95.7%, UTKFace 90.4%, LFWA+ 83.8%, CelebA 87.0%
Age accuracy: FairFace 53.6%, UTKFace 31.4%

FairFace models show per-group gender accuracy standard deviation of 3.0% (vs. 5.6% for LFWA+). Gender classification accuracy gap between male/female and White/non-White is consistently under 1% for FairFace, compared to gaps as large as 32% for smaller, skewed datasets.

Model	White M	White F	Black M	Black F	ε
FairFace	0.967	0.954	0.958	0.917	0.055
UTKFace	0.926	0.864	0.909	0.795	0.127
LFWA+	0.946	0.680	0.974	0.432	0.359
CelebA	0.829	0.958	0.819	0.919	0.166

Even when downsampled to the same number of training images (e.g., 9k or 18k), FairFace outperforms other datasets, indicating that diversity, rather than scale, is the primary driver of robustness across demographic axes.

6. Bias Assessment, Intersectional Coverage, and Data-Centric Audits

Subsequent research employing intersectional audits (Bahiru et al., 17 Oct 2025) formalizes two quantitative notions: Inclusivity ( $R$ )—fraction of all possible subgroup cells with at least some representation—and Diversity ( $D$ )—the ratio of the smallest to largest intersectional cell share. The audit reveals some missing cells (e.g., age bins 0–2 and 71–100 for specific races/genders), such that $R < 1.0$ and $D \approx 0$ for FairFace—despite its substantially improved balance compared to predecessors.

Downstream gender classifiers trained on FairFace show group-dependent True Positive Rates (TPR) and Disparate Impact (DI):

TPR gap across races: max 0.195
DI spans 0.887 (Black) to 1.066 (East Asian)

Females, especially in underrepresented racial-age cells, are misclassified at higher rates than males (e.g., Black females TPR = 0.650 vs. White females TPR = 0.792). Some subgroup cells contained fewer than a dozen images, negatively affecting generalization. This suggests that even deliberate balancing does not eradicate biases linked to underpopulated demographic intersections.

7. Extensions, Limitations, and Recommended Usage

Limitations include the lack of non-binary gender annotation, exclusion of Pacific Islanders and Native Americans, and coarse age binning. Annotation is subject to residual subjective noise despite consensus procedures.

FairFace is recommended for:

Stratified fairness benchmarking of face-attribute models
Auditing commercial systems for demographic consistency
Evaluating and developing new fairness interventions (e.g., adversarial methods, reweighting, semi-supervised debiasing as in (Dong et al., 11 Oct 2025))

Stratifying train/val/test splits by subgroup remains essential for preserving fairness properties when employing the dataset. Prospects for extension include non-binary, skin-tone, and finer age-group labels.

In summary, FairFace exemplifies a paradigm shift in dataset curation—favoring explicit demographic balancing, careful annotation, and model-assisted quality control—yielding demonstrable improvements in the equity and reliability of face-attribute classification systems relative to legacy benchmarks.

PDF Markdown Chat (Pro)

References (3)

FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age (2019)

Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach (2025)

Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification (2025)

Follow Topic

Get notified by email when new papers are published related to FairFace Dataset.