Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models
The paper discusses the cultural perception embedded within Text-to-Image (TTI) models such as DALL-E and StableDiffusion by investigating the cultural agency of these models and proposing a framework to unlock and evaluate cultural knowledge within generated images. The authors establish an ontology of cultural dimensions, domains, and concepts, from which they derive template prompts. These prompts aim to reveal the cultural inclinations latent within TTI models. Additionally, an assessment suite is proposed to critically evaluate the cultural content encoded in the images generated by these models.
Methodology and Dataset
The research stands on three foundational pillars: the development of a cultural ontology, the experimentation with various TTI models that incorporate multilingual text encoders, and the formulation of evaluation metrics—including intrinsic evaluations with CLIP space, extrinsic evaluations employing a Visual-Question-Answer (VQA) model, and human assessments.
The CulText2I dataset plays a pivotal role in this paper, consisting of images generated from prompts inspired by cultural concepts across ten languages, utilizing four distinct TTI models: StableDiffusion, AltDiffusion, DeepFloyd, and DALL-E. The dataset provides a robust foundation for assessing the cross-cultural capabilities of TTI models by consistently applying a comprehensive set of prompt templates and examining the resulting images' cultural fidelity.
Key Findings
The authors' inquiry into the cultural encoding within TTI models yielded noteworthy insights:
- Cultural Knowledge Encoding: The TTI models encode cultural indices diversely, with discernible variances highlighted in the linguistic context of the generated images. The intrinsic National Association (NA) scores suggest variation in cultural sensitivity among different models and across varying linguistic inputs.
- Unlocking Cultural Knowledge: Language-specific encoding through prompt templates revealed latent cultural knowledge inherent in the TTI models. Especially in non-intentional multilingual encoders, both fully translated prompts and prompts embedding cultural identifiers (e.g., nationality) equally unlocked cultural knowledge.
- Cultural Dimensions Projections: The TTI models encapsulate trends in cultural dimensions, aligning with cultural dimensions proposed in classical cultural studies. Visual outputs displayed distinct cultural interpretations invoking Big Five personality traits, Hofstede's dimensions, and Inglehart-Welzel cultural axes.
- Cross-Cultural Similarities and Distinctiveness: The inherent model biases affect perception and representation of cultures cross-linguistically. For example, European languages displayed higher inherent similarities, which is indicative of a shared cultural lexicon influenced by etymology.
- Influence of Alphabet Characters: Even a single linguistic character from a target language, when integrated into a prompt, noticeably influenced the resultant image's cultural tone. This highlights the models' sensitivity to linguistic nuances within prompts.
Implications and Future Directions
The outcomes of this research have significant implications for the application and development of TTI models in multicultural contexts. Practically, this paper informs the design of TTI models that are capable of more accurately rendering culturally pertinent images. Theoretically, it also provides a framework for further scrutinizing how these models can be optimized for cultural accuracy and inclusivity. One crucial insight is that explicit cultural parameters embedded in prompts serve as advantageous levers for surfacing and interrogating the cultural representations in model outputs.
In future work, researchers could explore more nuanced cultural concepts or dimensions and analyze a larger array of TTI models, especially considering the dynamic landscape of multilingual model training data. Additionally, expanding the dataset and integrating more complex human and automatic evaluations could provide a more detailed understanding of cultural perceptions in AI models. Finally, understanding the internal architectures of TTI models and how cross-linguistic prompts can influence outputs afford new avenues for enhancing cultural alignment in image generation tasks.