Papers
Topics
Authors
Recent
2000 character limit reached

Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Published 21 Nov 2025 in cs.CV and cs.AI | (2511.17393v1)

Abstract: Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies

Summary

  • The paper presents a novel multi-model pipeline for generating a demographically balanced synthetic face dataset tailored for verification tasks.
  • It quantifies model bias by revealing lower performance for female and East Asian faces, using metrics like TAR and EER.
  • The study highlights that realistic ID card style adaptations further degrade verification accuracy and emphasizes ethical concerns.

Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Motivation and Background

The efficacy and fairness of face verification systems in practical applications such as online banking, device authentication, and document verification are fundamentally constrained by bias in existing datasets. Legacy datasets, including LFW, RFW, and CFP-FP, are characterized by overrepresentation of certain demographics (notably white male celebrities) and lack intersectional diversity. This demographic imbalance, combined with unlicensed image sourcing, introduces significant bias and ethical risk, impeding the broad deployment of equitable face verification technologies. Recent advances in synthetic face generation via T2I models and diffusion-based methods offer new possibilities for producing demographically balanced synthetic datasets, potentially redressing these limitations.

Methodology Overview

The work proposes a robust pipeline for generating synthetic face images that are diverse in race, gender, permanent traits, and other facial characteristics relevant to identity card photo constraints. Four diffusion-based T2I models—Stable Diffusion 3.5, CosmicMan, Kandinsky, and FLUX.1—are employed with carefully engineered prompts combining a comprehensive set of facial attributes (A\mathcal{A}). Prompt engineering is critical, as adherence for less common features (e.g. albinism, vitiligo, Down syndrome) remains challenging for existing generative models (Figure 1). Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Prompt adherence issues in T2I models for synthetically generating faces with rare attributes such as albinism, Down syndrome, braces, and vitiligo.

Universal prompt refinements enforce realistic skin details and background uniformity, and negative prompts are used to further suppress undesired traits. Images generated are filtered by automated quality scores and by manual inspection to eliminate failures such as multi-face scenes. The process is repeated to maximize diversity and realism, typically requiring around 20 generations per image.

Example outputs from this multi-model generative pipeline demonstrate the breadth of feature and demographic diversity achievable via this approach (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Synthetic faces generated for a representative set of prompt combinations across racial, gender, and trait diversity, sourced from four different T2I models.

Identity Variation and ID Card Styling

For each unique identity, multiple variations are created using Arc2Face and PuLID generative models, combined recursively for extended diversity. To simulate ID card scenarios, generated images undergo background replacement (using HivisionIDPhotos) and style transfer (using PAMA with 23 nation-specific ID template images). This filter chain modifies the facial renders to more closely emulate real-world ID photo characteristics (Figure 3), enhancing the benchmark's relevance for document-style verification settings. Figure 3

Figure 3: Schematic illustration of style extraction and filtering to match the ID card format for synthetic face images.

DIF-V Dataset Composition

The outcome is the Diverse and Inclusive Faces for Verification (DIF-V) dataset comprising 27,780 images from 926 unique identities. Per identity, 30 image variants (covering diverse traits and styles) are produced. Gender and race distributions are tightly controlled for balance: approximately 300 images per gender and 115 identities per race, across eight racial categories, including representation for Pacific Islanders distinct from other Asian groups.

Empirical Benchmarking and Bias Analysis

Extensive evaluation of state-of-the-art face verification models (ArcFace, Facenet) on DIF-V reveals persistent bias in verification performance. Accuracy, True Acceptance Rate (TAR) at 1% False Acceptance Rate (FAR), Equal Error Rate (EER), and AUC are computed for subgroups by gender and race.

Quantitative results show systematic underperformance of both models for female faces relative to male and non-binary counterparts, especially for ArcFace (Figure 4). Figure 4

Figure 4

Figure 4: Comparative verification metrics by gender for DIF-V using ArcFace and Facenet; male subgroup consistently achieves highest TAR and lowest EER.

Race subgroup analysis exposes pronounced disparities: models perform worst on East Asian and Southeast Asian faces, while Indian and White faces attain peak accuracy and TAR (Figure 5). Figure 5

Figure 5

Figure 5: Verification accuracy, TAR, and EER across racial subgroups for ArcFace and Facenet on DIF-V; significant performance penalties are evident for East Asian subgroups.

A critical finding is that applying ID card style transformations to face images further degrades model performance—accuracy and TAR drop substantially post-styling (Figure 6), highlighting the challenge current models face for document-style verification. Figure 6

Figure 6: Model performance before and after ID style modification; ID card styling leads to marked deterioration in verification accuracy and TAR.

Ethical and Societal Considerations

The dataset design and generation pipeline foreground ethical risks endemic to generative models trained on web-scale data, which may encode and amplify gender, race, and other biases present in society. Furthermore, the realistic quality of AI-generated content exacerbates risks of misinformation and misuse, especially with ID-style images for impersonation or fraud. The provenance and attribution of synthetic content is a persistent challenge.

Implications and Future Directions

The DIF-V dataset provides a synthetic, balanced resource for systematic benchmarking and bias auditing of face verification models in demographic and document-specific contexts. The methodology could be expanded by iterating prompt engineering for emergent traits, integrating new generative or editing models, and automating style adaptation for region-specific ID schemas. Numerically, explicit bias metrics (e.g. subgroup calibration errors) could be incorporated alongside verification scores for more granular analysis.

From a model development perspective, these findings underscore the vulnerability of current state-of-the-art verification models to demographic and style transfer shifts, motivating research into debiasing algorithms, prompt-robust generative models, and fine-tuning schemes tailored for equitable verification performance across diverse populations and scenarios. The pipeline establishes a reproducible framework for developing and auditing future synthetic datasets targeting biometric tasks.

Conclusion

This work offers a comprehensive approach for building large-scale, demographically balanced synthetic face verification datasets, employing advanced diffusion-based generation, attribute-driven prompt engineering, and identity-variant synthesis combined with ID card style adaptation. Rigorous benchmarking on DIF-V exposes persistent verification biases by gender and race and reveals susceptibility to ID style modifications. These results sharpen the need for improved fairness in biometric systems and provide a blueprint for future dataset and model development aimed at robust, inclusive face verification.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.