Mitigating Unwanted Biases with Adversarial Learning

Published 22 Jan 2018 in cs.LG, cs.AI, and cs.CY | (1801.07593v1)

Abstract: Machine learning is a tool for building models that accurately represent input training data. When undesired biases concerning demographic groups are in the training data, well-trained models will reflect those biases. We present a framework for mitigating such biases by including a variable for the group of interest and simultaneously learning a predictor and an adversary. The input to the network X, here text or census data, produces a prediction Y, such as an analogy completion or income bracket, while the adversary tries to model a protected variable Z, here gender or zip code. The objective is to maximize the predictor's ability to predict Y while minimizing the adversary's ability to predict Z. Applied to analogy completion, this method results in accurate predictions that exhibit less evidence of stereotyping Z. When applied to a classification task using the UCI Adult (Census) Dataset, it results in a predictive model that does not lose much accuracy while achieving very close to equality of odds (Hardt, et al., 2016). The method is flexible and applicable to multiple definitions of fairness as well as a wide range of gradient-based learning models, including both regression and classification tasks.

Abstract PDF Upgrade to Chat

Citations (1,275)

View on Semantic Scholar

Summary

The paper introduces an adversarial debiasing framework that minimizes the influence of protected attributes while preserving prediction accuracy.
It integrates a dual-component model with a predictor for target outcomes and an adversary that suppresses bias from protected variables.
Experiments on synthetic data, word embeddings, and the UCI Adult dataset show effective bias reduction with minimal performance trade-offs.

Mitigating Unwanted Biases With Adversarial Learning

The paper "Mitigating Unwanted Biases with Adversarial Learning" by Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell addresses the pervasive issue of bias in machine learning models. The authors propose an adversarial debiasing framework designed to mitigate biases related to protected variables such as gender and zip code while maintaining the accuracy of the predictions.

Framework Overview

The proposed framework integrates an adversarial component into the learning process. The model comprises two primary components:

Predictor: This component predicts the target variable $Y$ given the input $X$ .
Adversary: This component aims to predict the protected variable $Z$ from the predictor's output $\hat{Y}$ .

The training objective is to maximize the predictor's accuracy in predicting $Y$ while minimizing the adversary's accuracy in predicting $Z$ . This is achieved through a gradient-based method, where the predictor and adversary are trained in opposition to each other, similar to a Generative Adversarial Network (GAN) framework.

Fairness Measures

The authors discuss multiple fairness measures including Demographic Parity, Equality of Odds, and Equality of Opportunity. The chosen fairness measure can be imposed as constraints or incorporated into the loss function to ensure that the output predictions remain unbiased regarding the protected variable $Z$ .

Demographic Parity: Ensures that the predictor's output is independent of the protected variable.
Equality of Odds: Ensures that the predictor's output is conditionally independent of the protected variable when given $Y$ .
Equality of Opportunity: A subset of Equality of Odds, focusing on a particular class $Y=y$ .

Experiments and Results

Toy Scenario

The authors first validate their approach on a synthetic dataset where $Y$ is generated based on $X$ and the protected variable $Z$ . The predictor, without debiasing, significantly incorporates $Z$ into its predictions. With debiasing, the model reduces the influence of $Z$ , demonstrating success in mitigating bias.

Word Embeddings

Word embeddings often reflect societal biases. The authors apply their method to word embeddings to reduce gender bias while maintaining predictive performance. They use the Google analogy dataset and show that debiased embeddings perform analogies accurately without gender stereotyping.

In their word analogy task, they utilize a direction-based approach to define a subspace representing gender. By removing this subspace from the embeddings during training, the model effectively reduces bias. For example, the analogy completions for "he : she :: doctor : ?" change from biased completions like "nurse" to less stereotypical completions.

UCI Adult Dataset

The adversarial debiasing approach is applied to the UCI Adult dataset, targeting the income prediction task while reducing gender bias. The model uses logistic regression to predict income (> $50K or ≤$50K) while maintaining equality of odds between males and females.

The debiased model achieves near equality of odds, balancing the false positive rates (FPR) and false negative rates (FNR) across gender subgroups. The results show that the FPR for females and males are $0.0647$ and $0.0701$ respectively, and the FNR are $0.4458$ and $0.4349$, indicating reduced bias in the model's predictions.

Theoretical Guarantees

The paper provides theoretical guarantees under certain conditions:

Convergence: If the predictor and adversary converge, the method enforces the desired fairness constraint (Demographic Parity or Equality of Odds).
Optimality: The predictor's model maintains performance on the target task while meeting fairness constraints.

Conclusion and Future Work

The adversarial debiasing method proves to be a robust technique for mitigating biases in machine learning models. The experiments indicate that the approach is effective across different domains, from synthetic data to real-world datasets like word embeddings and the UCI Adult dataset.

Future work might explore:

Utility of Debiased Embeddings: Assessing the performance of debiased embeddings in various complex NLP tasks beyond analogies.
Training Stability: Developing methods to stabilize the adversarial training process to ensure reliable convergence.
Image Recognition: Extending the adversarial debiasing framework to image recognition tasks to address biases in visual data.
Adversary Complexity: Investigating the need for complex adversaries for more sophisticated predictive models and continuous variables.

Overall, the paper makes significant strides in addressing bias in AI, presenting a generalizable and theoretically grounded approach crucial for the development of fair and accountable machine learning systems.

Markdown