The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention (2407.00377v2)

Published 29 Jun 2024 in cs.CL, cs.AI, cs.CV, and cs.CY

Abstract: Prompt-based "diversity interventions" are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures. In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3's generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a LLM to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

Summary

The paper introduces DoFaiR, a benchmark dataset evaluating demographic factuality in text-to-image models under diversity prompts.
It quantifies the factuality tax with metrics, revealing an 11.03% drop in accuracy and a 181.66% rise in diversity divergence from historical data.
The study proposes fact-augmented interventions that integrate historical knowledge, significantly improving factual precision without sacrificing diversity.

The Factuality Tax of Diversity-Intervened Text-to-Image Generation

Introduction

The paper "The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention" investigates the impact of diversity-oriented prompts on the demographic factuality in Text-to-Image (T2I) models. The authors introduce a benchmarking dataset named DemOgraphic FActualIty Representation (DoFaiR), which evaluates the extent to which these diversity prompts alter historical demographics. They also propose a novel approach named Fact-Augmented Intervention (FAI) to mitigate the factuality tax while preserving diversity.

Background

Text-to-Image models, such as DALLE-3 and Stable Diffusion, have been found to generate images laden with social biases, manifesting stereotypes like depicting doctors as male and nurses as female. To combat these biases, existing methods use diversity interventions in the form of prompt modifications that instruct the models to produce demographically diverse outputs. However, these interventions can distort historical demographic distributions, leading to nonfactual and potentially offensive outputs when generating images depicting historical figures or events.

The DoFaiR Benchmark

Dataset and Metrics:

The DoFaiR benchmark consists of 756 meticulously curated test cases that reflect accurate demographic distributions extracted from historical contexts. The dataset includes both gender-related and race-related instances, verified through automated pipelines and human validation. DoFaiR provides a set of metrics for evaluating demographic factuality:

Dominant Demographic Accuracy (DDA)
Involved Demographic Accuracy (IDA)
Involved Demographic F-1 Score (IDF)
Factual Diversity Divergence (FDD)

Pipeline:

The DoFaiR evaluation pipeline involves generating images based on prompts that specify historical events and participant classes. Faces in these images are detected and analyzed using the FairFace demographic classifier. The generated demographic distributions are then compared against ground truth data.

Experimental Observations

Trade-off Analysis:

Experiments on DoFaiR reveal a significant factuality tax associated with diversity interventions:

The diversity instructions resulted in an increased demographic diversity that diverges notably from historical facts.
Notably, the evaluation showed a 181.66% increase in diversity divergence and an 11.03% decrease in factuality accuracy on average for the tested models.

Model Performance:

The experiments show that T2I models, especially when using diversity-oriented prompts, perform worse in accurately depicting racial groups than gender compositions. Additionally, models find it more challenging to depict involved demographic groups accurately compared to dominant demographics.

Fact-Augmented Intervention (FAI)

To address the trade-off between diversity and factuality, the authors propose the FAI method, which augments T2I model prompts with historical information:

FAI-VK (Verbalized Knowledge): Incorporates verbalized factual knowledge elicited from a strong LLM.
FAI-RK (Retrieved Knowledge): Uses a pipeline that retrieves relevant historical documents from Wikipedia and integrates this knowledge into the generation prompts.

Results:

Both FAI methods effectively mitigate the factuality tax of diversity interventions, significantly improving demographic factuality.
FAI-RK demonstrates strong performance in preserving factual demographic diversity, reducing nonfactual diversity more effectively than the baseline methods.
Additionally, FAI methods outperform simple Chain-of-Thought (CoT) reasoning, which fails to maintain factual correctness under diversity prompts.

Practical and Theoretical Implications

From a practical standpoint, the proposed FAI methods enable the use of diversity interventions without compromising historical accuracy. This has implications for applications requiring high-fidelity representations of historical events and figures, such as educational materials and media production. Theoretically, the work encourages a reevaluation of how diversity interventions are designed and applied in generative models, advocating for factually-informed approaches to maintain credibility and trustworthiness.

Conclusion and Future Directions

The study elucidates the challenges and trade-offs inherent in using diversity prompts in T2I models. By introducing a comprehensive benchmark (DoFaiR) and proposing effective mitigation strategies (FAI), the authors lay the groundwork for more controlled and factually grounded generative processes. Future research could extend these findings to other demographic dimensions and explore adaptive intervention strategies that dynamically balance diversity and factuality based on contextual needs.