An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained LLMs
This paper presents a thorough investigation into representational harms within large-scale Pre-Trained LLMs (PTLMs), specifically focusing on the biases that these models may harbor against marginalized groups. Given the pervasive adoption of PTLMs in natural language processing tasks, it is crucial to understand and mitigate any societal biases they might perpetuate. The paper introduces a novel metric aimed at quantifying implicit representational harms targeted at 13 specific demographic groups. Using this metric, the authors conduct an empirical analysis across 24 well-known PTLMs.
Key Contributions
The authors make two primary contributions. First, they offer a clear conceptualization of representational harms toward marginalized groups and introduce a metric to quantify these phenomena within PTLMs. The measurement model utilized here adheres to methodologies from the social sciences, adopting a two-stage approach: conceptualization and operationalization. Conceptualization involves defining the target demographics and representational harms, while operationalization assesses these harms using a LLMing-based likelihood comparison of harmful versus benign statements.
Second, the paper presents an empirical evaluation of representational harms in PTLMs, analyzing how network architecture elements such as depth and width influence these biases. Notably, the paper finds that prioritizing network depth over width can sometimes mitigate these harms.
Methodology
The metric centers around LLMing objectives, measured using perplexity, or pseudo-perplexity in auto-encoder models, effectively gauging the likelihood of implicitly harmful vs. benign statements produced by the models. The evaluation dataset is a subset of the ToxiGen dataset, annotated to differentiate harmful content from benign among 13 marginalized groups.
An intriguing methodological choice is the Mann-Whitney U-test, which quantifies the likelihood disparities and provides a 'safety score'. This score reflects the efficacy of the metric in capturing implicit biases, with higher scores indicating greater likelihoods of benign sentences compared to harmful ones in the model output.
Results and Implications
The safety scores reveal that PTLMs are prone to manifesting considerable representational harms, with a noted tendency to differentially affect marginalized demographics. Variations in safety scores across models also suggest that PTLMs' internal architectures significantly influence their bias levels. Specifically, deeper models incur less representational harm when compared to wider models.
The implications of this research are multifaceted. Practically, the findings highlight the need for careful architectural considerations in model development and suggest that future architectural innovations should consider bias mitigation as a core component. Theoretically, the work reinforces the necessity of diverse, interdisciplinary approaches to understanding and mitigating biases in AI systems. This includes integrating insights from social sciences to refine metrics and model fairness.
The results demonstrate that current intrinsic and extrinsic metrics used for bias assessment capture different aspects of representational harms—and help highlight unknown biases—indicating a potential gap in existing evaluation frameworks that this paper begins to bridge. The authors advocate for expanded bias paper metrics and datasets, pushing for systemic evaluations that help align technical improvements with ethical AI commitments.
Future Directions
Future research could expand on interactional demographics and investigate bias dynamics within combined marginalized groups—such as Middle Eastern women—to provide a more holistic evaluation. Furthermore, the potential application of the safety score as an objective function in training PTLMs presents an intriguing avenue for developing more equitable AI models.
Overall, this paper contributes significantly to the ongoing discourse on social biases in AI, advocating for more comprehensive strategies to ensure fairness and equity in LLMs’ development and deployment.