Social Biases through the Text-to-Image Generation Lens (2304.06034v1)

Published 30 Mar 2023 in cs.CY, cs.AI, cs.CL, and cs.CV

Abstract: Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software by generating illustrative content with high photorealism starting from a given descriptive text as a prompt. Such models are however trained on massive amounts of web data, which surfaces the peril of potential harmful biases that may leak in the generation process itself. In this paper, we take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images, by focusing on how occupations, personality traits, and everyday situations are depicted across representations of (perceived) gender, age, race, and geographical location. Through an extensive set of both automated and human evaluation experiments we present findings for two popular T2I models: DALLE-v2 and Stable Diffusion. Our results reveal that there exist severe occupational biases of neutral prompts majorly excluding groups of people from results for both models. Such biases can get mitigated by increasing the amount of specification in the prompt itself, although the prompting mitigation will not address discrepancies in image quality or other usages of the model or its representations in other scenarios. Further, we observe personality traits being associated with only a limited set of people at the intersection of race, gender, and age. Finally, an analysis of geographical location representations on everyday situations (e.g., park, food, weddings) shows that for most situations, images generated through default location-neutral prompts are closer and more similar to images generated for locations of United States and Germany.

PDF Abstract

Social Biases through the Text-to-Image Generation Lens

The advancement of Text-to-Image (T2I) generation models, exemplified by DALLE-v2 and Stable Diffusion, offers transformative societal applications ranging from design to entertainment. However, these models, heavily reliant on extensive datasets sourced from the web, may be inadvertently embedding and propagating social biases within their generated outputs. The paper "Social Biases through the Text-to-Image Generation Lens" methodically investigates these biases across multiple dimensions, including gender, race, age, and geographic location, revealing the extent to which they manifest in the portrayal of occupations, personality traits, and everyday situations.

Methodological Overview

The paper employs a multi-faceted approach, utilizing both automated and human evaluation mechanisms to appraise image outputs from neutral and expanded prompts. By contrasting these with real-world demographic statistics from the U.S. Bureau of Labor Statistics (BLS), it provides a baseline for assessing how closely image generation models align with societal representations. Furthermore, by expanding prompts, the paper evaluates the efficacy of detailed inputs in ameliorating biased representations, while also scrutinizing resultant impacts on image quality.

Key Findings on Bias Representation

Occupations: The paper highlights pronounced discrepancies in gender representation, with women significantly under-represented in neutral prompts for occupations like CEO and computer programmer in DALLE-v2 outputs. Conversely, roles such as nurse or housekeeper feature predominantly female figures, particularly in Stable Diffusion outputs. When prompts specify gender, race, or age, biases diminish for representation but may introduce new disparities in image quality. Critical observations point to concerns where the models favor racial stereotypes, particularly overrepresenting white individuals while neglecting others.
Personnage and Personality Traits: Both models exhibit persistent gender and racial biases. DALLE-v2 predominantly generates images of younger men for neutral "person" prompts, while Stable Diffusion tilts toward female representation but predominantly in white racial contexts. The paper expounds on personality trait prompts, revealing strong stereotypical associations — with men linked heavily to competence-like traits and women to warmth-related traits.
Geographical Representation: The analysis extends to visual representations of everyday situations across diverse geographies. Findings demonstrate a skew towards countries like the USA and Germany in default prompts, whereas countries such as Nigeria and Ethiopia are less represented. This bias indicates potential propagation of distorted cultural imagery and economic perceptions.

Implications and Future Directions

The results underscore the critical need for continuous scrutiny of T2I models, focusing on enhancing dataset curation and model training algorithms to foster equitable representation. While expanded prompts serve as a tool for diversifying solution outputs, they are not panaceas, often leading to inconsistent image quality and perpetuating deeper cultural stigmas.

The paper advocates for leveraging complementary mitigation strategies such as prompting duplication and post-generation output filters to cultivate representational fairness, thereby influencing how AI models navigate societal contexts. By advancing methodological rigor in evaluating biases, the paper lays groundwork for future research to further dissect and rectify underlying disparities in AI-generated imagery. As T2I capabilities advance, informed approaches will be indispensable to ensure responsible deployment that reflects an inclusive and balanced visual narrative.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Ranjita Naik (8 papers)
Besmira Nushi (38 papers)

Citations (81)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos