Social perception of faces in a vision-language model (2408.14435v1)

Published 26 Aug 2024 in cs.CV, cs.AI, cs.CY, and cs.LG

Abstract: We explore social perception of human faces in CLIP, a widely used open-source vision-LLM. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP's social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-LLM.

Authors (4)

Carina I. Hausladen (5 papers)
Manuel Knott (7 papers)
Colin F. Camerer (2 papers)
Pietro Perona (78 papers)

Citations (2)

View on Semantic Scholar

Summary

Social Perception of Faces in a Vision-LLM

The paper "Social perception of faces in a vision-LLM" by Hausladen, Knott, Camerer, and Perona explores the capability of the CLIP (Contrastive Language-Image Pretraining) model to make social judgments of human faces. Their approach involves comparing the similarity in CLIP embeddings between different textual prompts and synthetic face images that are systematically varied along specific dimensions. This design mitigates confounding variables often found in real-world data, offering a clearer examination of biases related to protected attributes such as age, gender, and race.

Key Findings

Human-Like Social Judgments by CLIP: Despite the broad diversity of images and texts in CLIP’s training set, CLIP can make nuanced human-like social judgments on face images. This finding is significant because it extends CLIP's capabilities from broad to fine-grained social perception.
Impact of Protected Attributes: The paper reveals that age, gender, and race systematically affect CLIP’s social perception of faces, suggesting inherent biases. Specifically, the authors find pronounced disparities in the perception of Black women’s faces across different ages and facial expressions, indicating extreme values of social perception in these categories.
Role of Non-Protected Attributes: Non-protected attributes such as facial expression and lighting significantly influence social judgments. For instance, facial expression variabilities, such as smiling, have a larger impact on social perception than age, while lighting affects perception almost as much as age. This underscores the necessity to control non-protected visual attributes to avoid confounding results in bias studies.

Methodology

The authors employ a novel method grounded in social psychology and leverage synthetic face images where attributes are independently and systematically manipulated. This experimental paradigm not only controls for confounding factors present in wild-collected data but also enables causal inference regarding the effect of specific attributes on social perception.

Experimental Setup

Textual Prompts:

Textual prompts are constructed from validated terms in social psychology, representing dimensions such as Warmth and Competence (stereotype content model) and Communion, Agency, and Belief (ABC model).

Synthetic Face Dataset:

The synthetic dataset, dubbed CausalFace, systematically varies faces across six dimensions: race, gender, age, facial expression, lighting, and pose. These variations allow for precise control and clear isolation of the effects of each attribute.

Results and Implications

Statistical Similarities and Variations

Similarity to Real Data:

Bias metrics indicate that CausalFace closely mirrors real-world datasets (FairFace and UTKFace) in bias measurement, thus validating its applicability for this paper.

Effect Comparisons:

Comparing variations caused by protected versus non-protected attributes, the paper finds that non-protected attributes like facial expression and lighting can influence social perception as much as or even more than protected attributes. This finding is crucial, indicating the necessity of comprehensive attribute accounting to understand and mitigate biases.

Detailed Observations

Intersectional Analysis: The paper provides an intersectional analysis, revealing that CLIP’s perception is markedly different across demographic groups, notably impacting Black women’s faces. Age and smiling lead to significant shifts in social perception, with non-protected attributes showing substantial effects.
Bias Patterns: CLIP demonstrates biased tendencies across various demographic groups, with nuanced responses to age and facial expressions. This insight suggests that biases in vision-LLMs are multifaceted and not merely limited to a few observable socio-demographic categories.

Conclusions

The paper's findings have profound implications for both the practical deployment and theoretical understanding of vision-LLMs. Practically, the research underscores the importance of controlling for both protected and non-protected attributes to accurately assess and mitigate biases in AI systems. Theoretically, it offers a robust experimental framework for studying social biases in any vision-LLM, extending beyond observational methods that are susceptible to confounding variables.

Future Directions

Future research can build on this work by generating even richer synthetic datasets that explore finer intersections of attributes and addressing potential brightness variations for a more nuanced and comprehensive analysis. Additionally, comparing different vision-LLMs could illuminate how various training datasets and architectures influence social judgments, thus driving forward the responsible use of AI in socially sensitive applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/CarinaHausladen/status/1842632297033449558

YouTube

Show All Videos