Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance (2406.04551v2)

Published 6 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

Authors (6)

Reyhane Askari Hemmat (8 papers)
Melissa Hall (24 papers)
Alicia Sun (5 papers)
Candace Ross (25 papers)
Michal Drozdzal (45 papers)
Adriana Romero-Soriano (30 papers)

Summary

The paper presents c-VSG, an inference-time intervention that incorporates Vendi Score guidance into latent diffusion models to achieve up to 40% improvement in geographic diversity.
It leverages geographically diverse datasets like GeoDE and DollarStreet to demonstrate notable enhancements in image variety and a reduction in stereotypical region representations.
The approach employs a balanced strategy of using exemplar images and a memory bank to maintain high quality while diversifying generated images effectively.

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

The burgeoning field of text-to-image generative models has unveiled the pivotal challenge of generating images that truly reflect the diverse geographical and cultural contexts of the real world. The paper "Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance" addresses this issue by introducing a novel inference-time intervention named Contextualized Vendi Score Guidance (c-VSG). This method aims to enhance geographic diversity in image generation without sacrificing the quality or text-image consistency of the output.

Core Contributions

The paper contributes primarily in the following areas:

Diversity Enhancement with Contextualized VSG:
- The central innovation is the integration of the Vendi Score (VS), an evaluation metric for dataset diversity, into the latent diffusion models (LDMs) to drive the generation of more diverse images. Specifically, c-VSG modifies the score function of LDMs to enhance diversity among generated samples and maintain realism by leveraging a memory bank of previous images and a set of real-world exemplar images.
Evaluation using Geographically Diverse Datasets:
- The effectiveness of c-VSG is evaluated using two geographically representative datasets, GeoDE and DollarStreet. The results indicate significant improvements in image diversity, notably increasing by up to 40% in terms of worst-region F1 compared to the baseline LDM.
Reduction in Geographical Representation Bias:
- Qualitative analyses indicate that c-VSG importantly mitigates the presence of reductive and stereotypical representations of geographic regions in generated images.

Methodology

The proposed c-VSG framework operates by modifying the score function during the denoising step of LDMs. Specifically, the Vendi Score is employed in two aspects:

Memory Bank for Diversity:
- The Vendi Score encourages the generation of images that are different from the ones stored in the memory bank, thereby enhancing diversity.
Exemplar Images for Contextualization:
- A set of randomly chosen real-world images (exemplars) guide the generation process, ensuring the generated images are realistic and anchored to the contextual features of the real world.

The synthesis process involves balancing these two influences using scaling parameters ( $\alpha$ for the memory bank and $\beta$ for contextualization). This hybrid strategy mitigates the trade-off between diversity and quality.

Results and Analysis

Empirical Evaluation:
- The c-VSG significantly outperforms the baseline LDM in terms of F1 score on both GeoDE and DollarStreet datasets. Notable improvements include a 40% relative improvement in the worst-region F1 score for GeoDE and substantial performance gains in both quality and consistency.
Qualitative Improvements:
- Visual inspections reveal that c-VSG delivers greater variation in object types, colors, and compositions compared to the baseline. The technique reduces the prominence of stereotypical backgrounds and aligns more closely with the real-world distribution of images.
Ablation Studies:
- Systematic ablations demonstrated the efficacy of each component within c-VSG. Variations like using exclusively exemplar images or the memory bank singularly underperformed compared to their combined use, underscoring the importance of balancing both to achieve optimal results.

Implications and Future Directions

The implications of this research are twofold:

Practical Enhancements:
- c-VSG presents a feasible strategy for improving the geographical diversity of images in applications like content creation, digital media, and educational tools. This can enrich the user experience by providing culturally and contextually relevant imagery.
Theoretical Advancement:
- The method substantially advances the theoretical understanding of how diversity metrics like the Vendi Score can be operationalized within the guidance mechanisms of generative models. This invites further exploration into other diversity metrics and their potential applications in AI.

Looking ahead, future developments could involve:

Further refining the balance parameters ( $\alpha$ and $\beta$ ) dynamically during the generation process.
Exploring the scalability of c-VSG to larger and more complex datasets.
Integrating human evaluation studies to gauge subjective perceptions of diversity and realism.

In conclusion, this paper provides a sophisticated approach to mitigating geographical representation biases in text-to-image generation. By blending the Vendi Score for diversity and a contextualizing framework, c-VSG achieves substantial improvement in both the diversity and quality of generated images, signifying a meaningful step towards more inclusive and representative AI-generated content.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ReyhaneAskari/status/1811394766829134326

https://twitter.com/ReyhaneAskari/status/1802889961244700933

https://twitter.com/hall__melissa/status/1803572964539203800

https://twitter.com/Vertaix_/status/1805219057865527748

https://twitter.com/Vertaix_/status/1911871918392639724