Beyond Aesthetics: Cultural Competence in Text-to-Image Models
The rapid advancements in text-to-image (T2I) models have brought about revolutionary shifts in creative domains such as digital arts, advertising, and education. However, evaluating these models solely based on photo-realism, faithfulness, and aesthetics leaves a critical gap in understanding their cultural competence. This paper addresses this gap by introducing a comprehensive benchmark, CUBE (CUltural BEnchmark for Text-to-Image models), to evaluate T2I models on their cultural awareness and cultural diversity.
Contributions and Methodology
The key contributions of this work lie in the development of CUBE, which comprises two main components: CUBE-1K and CUBE-CSpace. The former is a gold-standard dataset of 1000 prompts crafted to evaluate the cultural awareness of T2I models, while the latter is an extensive resource containing approximately 300K cultural artifacts used for grounding and analyzing cultural diversity.
Cultural Awareness: The evaluation of cultural awareness is carried out using CUBE-1K, focusing on three main concepts: cuisine, landmarks, and art. The methodology involves using structured knowledge bases (KBs) such as WikiData to extract a vast array of cultural artifacts. This extraction is refined using LLMs like GPT-4-Turbo to filter and complete the collection, ensuring the inclusion of diverse and relevant artifacts. Human annotators from various cultures rated the generated images on cultural relevance, faithfulness, and realism, revealing notable gaps in the existing T2I models' ability to accurately and realistically represent diverse cultural artifacts.
Cultural Diversity: The paper introduces cultural diversity (CD) as a novel metric for evaluating T2I models. This metric leverages the quality-weighted Vendi score, which balances the diversity of artifacts with their generation quality. Various similarity kernels are defined to capture different facets of geo-cultural diversity, including continent-level, country-level, and artifact-level similarities.
Numerical Results and Implications
The human evaluation results demonstrate significant disparities in the cultural competence of T2I models across different countries and cultural concepts. Both Imagen 2 and Stable Diffusion XL showed extensive room for improvement, especially in representing artifacts from the Global South. The models frequently displayed biases towards well-represented and popular countries, with lower cultural awareness and diversity scores for countries like Brazil, Turkey, and Nigeria.
The quantitative results for cultural diversity, evaluated using a wide range of prompts and seeds, revealed that none of the models performed exceptionally well. Even the best models exhibited relatively low diversity scores, indicating a lack of comprehensive geo-cultural representation. This underscores the need for explicit prioritization of cultural diversity in the development and training of T2I models.
Future Directions
The findings of this paper have profound implications for the future development of T2I models. There is a clear need to integrate cultural competence as a core objective in model training and evaluation processes. This involves not just expanding datasets to include more culturally diverse inputs but also developing metrics that accurately capture the nuances of cultural representation.
Practical Implications: For practitioners, the CUBE benchmark provides a valuable tool to evaluate and improve the cultural competence of T2I models. Integrating such evaluations into the model development lifecycle can ensure that the models serve a truly global audience, mitigating the risks of cultural misrepresentation and bias.
Theoretical Implications: From a theoretical standpoint, the introduction of novel metrics to evaluate cultural diversity contributes to the broader understanding of what constitutes fairness and inclusivity in AI models. Future research can build on this work by exploring more granular cultural definitions and incorporating additional dimensions such as sub-cultures and co-cultures.
Conclusion
In conclusion, this paper makes significant strides in highlighting and addressing the gaps in cultural competence in T2I models. The creation of the CUBE benchmark and the introduction of cultural awareness and diversity as evaluation dimensions mark an important step towards developing more inclusive and globally representative AI technologies. Future advancements in this space will likely see the integration of these benchmarks and metrics into standard model evaluation frameworks, paving the way for AI systems that better cater to the rich and diverse tapestry of human cultures.