- The paper introduces DoFaiR, a benchmark dataset evaluating demographic factuality in text-to-image models under diversity prompts.
- It quantifies the factuality tax with metrics, revealing an 11.03% drop in accuracy and a 181.66% rise in diversity divergence from historical data.
- The study proposes fact-augmented interventions that integrate historical knowledge, significantly improving factual precision without sacrificing diversity.
The Factuality Tax of Diversity-Intervened Text-to-Image Generation
Introduction
The paper "The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention" investigates the impact of diversity-oriented prompts on the demographic factuality in Text-to-Image (T2I) models. The authors introduce a benchmarking dataset named DemOgraphic FActualIty Representation (DoFaiR), which evaluates the extent to which these diversity prompts alter historical demographics. They also propose a novel approach named Fact-Augmented Intervention (FAI) to mitigate the factuality tax while preserving diversity.
Background
Text-to-Image models, such as DALLE-3 and Stable Diffusion, have been found to generate images laden with social biases, manifesting stereotypes like depicting doctors as male and nurses as female. To combat these biases, existing methods use diversity interventions in the form of prompt modifications that instruct the models to produce demographically diverse outputs. However, these interventions can distort historical demographic distributions, leading to nonfactual and potentially offensive outputs when generating images depicting historical figures or events.
The DoFaiR Benchmark
Dataset and Metrics:
The DoFaiR benchmark consists of 756 meticulously curated test cases that reflect accurate demographic distributions extracted from historical contexts. The dataset includes both gender-related and race-related instances, verified through automated pipelines and human validation. DoFaiR provides a set of metrics for evaluating demographic factuality:
- Dominant Demographic Accuracy (DDA)
- Involved Demographic Accuracy (IDA)
- Involved Demographic F-1 Score (IDF)
- Factual Diversity Divergence (FDD)
Pipeline:
The DoFaiR evaluation pipeline involves generating images based on prompts that specify historical events and participant classes. Faces in these images are detected and analyzed using the FairFace demographic classifier. The generated demographic distributions are then compared against ground truth data.
Experimental Observations
Trade-off Analysis:
Experiments on DoFaiR reveal a significant factuality tax associated with diversity interventions:
- The diversity instructions resulted in an increased demographic diversity that diverges notably from historical facts.
- Notably, the evaluation showed a 181.66% increase in diversity divergence and an 11.03% decrease in factuality accuracy on average for the tested models.
Model Performance:
The experiments show that T2I models, especially when using diversity-oriented prompts, perform worse in accurately depicting racial groups than gender compositions. Additionally, models find it more challenging to depict involved demographic groups accurately compared to dominant demographics.
Fact-Augmented Intervention (FAI)
To address the trade-off between diversity and factuality, the authors propose the FAI method, which augments T2I model prompts with historical information:
- FAI-VK (Verbalized Knowledge): Incorporates verbalized factual knowledge elicited from a strong LLM.
- FAI-RK (Retrieved Knowledge): Uses a pipeline that retrieves relevant historical documents from Wikipedia and integrates this knowledge into the generation prompts.
Results:
- Both FAI methods effectively mitigate the factuality tax of diversity interventions, significantly improving demographic factuality.
- FAI-RK demonstrates strong performance in preserving factual demographic diversity, reducing nonfactual diversity more effectively than the baseline methods.
- Additionally, FAI methods outperform simple Chain-of-Thought (CoT) reasoning, which fails to maintain factual correctness under diversity prompts.
Practical and Theoretical Implications
From a practical standpoint, the proposed FAI methods enable the use of diversity interventions without compromising historical accuracy. This has implications for applications requiring high-fidelity representations of historical events and figures, such as educational materials and media production. Theoretically, the work encourages a reevaluation of how diversity interventions are designed and applied in generative models, advocating for factually-informed approaches to maintain credibility and trustworthiness.
Conclusion and Future Directions
The study elucidates the challenges and trade-offs inherent in using diversity prompts in T2I models. By introducing a comprehensive benchmark (DoFaiR) and proposing effective mitigation strategies (FAI), the authors lay the groundwork for more controlled and factually grounded generative processes. Future research could extend these findings to other demographic dimensions and explore adaptive intervention strategies that dynamically balance diversity and factuality based on contextual needs.