Reducing Sentiment Bias in Language Models via Counterfactual Evaluation (1911.03064v3)

Published 8 Nov 2019 in cs.CL, cs.CY, and cs.LG

Abstract: Advances in LLMing architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by LLMs: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a LLM, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the LLM's latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.

PDF Abstract

Reducing Sentiment Bias in LLMs via Counterfactual Evaluation

In this paper, the authors present an investigation into the presence and mitigation of sentiment bias within large-scale LLMs. This research focuses on how the sentiment of generated text from LLMs can be affected by sensitive attributes such as country names, occupations, and genders. The motivation behind this research arises from the notion that LLMs can unintentionally internalize social biases present in their training dataset, leading to biased sentiment outputs conditioned on certain demographic variables.

Methodology

To address the bias, the authors employ a framework that uses counterfactual evaluation, allowing for an analysis of how sensitive attributes influence sentiment. Counterfactual evaluation involves systematically altering sensitive attribute values within input contexts to assess their impact on the sentiment of the resulting output text. For quantification, the researchers adopt fairness metrics from machine learning literature, specifically focusing on individual and group fairness metrics.

In their counterfactual evaluation, they explore the distributional differences in sentiment using three sentiment classification techniques: Google Cloud sentiment API, a BERT model fine-tuned on Stanford Sentiment Treebank, and an opinion-word-based classifier. These classifiers provide a diverse range of perspectives considering variance in sensitivity and scope toward biased content.

Proposed Solution

The authors propose two strategies to reduce sentiment bias via regularization on latent representations in the LLM:

Embedding Regularization: Encourages the hidden states of the LLM to remain consistent across different sensitive attribute values. This regularization is aimed at making the LLM insensitive to perturbations in sensitive attribute tokens, thereby dampening unintended biases.
Sentiment Regularization: Utilizes a sentiment classifier to ensure that the projections of these embeddings onto sentiment-related dimensions are similar across varying sensitive attribute conditions.

The proposed methods are integrated into a three-step curriculum training approach: initial LLM training, sentiment classifier training with labeled sentiment data, and a final phase of debiasing with the chosen regularization approach.

Experimental Evaluation

Empirical evaluation is conducted on two corpora: a medium-scale Wikipedia dataset (WikiText-103) and a large-scale news dataset (WMT-19). Both datasets were used to train TransformerXL-based LLMs, and the proposed debiasing methods were applied to mitigate sentiment bias. The evaluation metrics included perplexity, semantic similarity, individual fairness, and group fairness scores derived from generated text samples.

Results indicate that both embedding and sentiment regularization techniques reduce sentiment bias according to the proposed fairness metrics. However, there is a trade-off between the reduction in bias and the semantic relevance of generated sentences. Sentiment regularization, in particular, offers a balanced performance, achieving commendable reductions in bias while maintaining contextual relevance.

Implications

This research underscores the importance of fairness considerations in LLMs, suggesting computational methods to quantify and reduce biases. The implications of the model are profound for AI applications that integrate LLMs—such as dialogue systems, content generation tools, and sentiment analysis frameworks—where biased outputs can have ethical and societal impact.

Moving forward, this framework could be extended to address other forms of bias beyond sentiment, potentially affecting different domains and applications. This paper represents a pivotal step in creating more equitable AI systems by proactively addressing biases at the model level through targeted interventions.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Po-Sen Huang (30 papers)
Huan Zhang (171 papers)
Ray Jiang (11 papers)
Robert Stanforth (18 papers)
Johannes Welbl (20 papers)
Jack Rae (8 papers)
Vishal Maini (2 papers)
Dani Yogatama (49 papers)
Pushmeet Kohli (116 papers)

Citations (197)

View on Semantic Scholar