Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Review of Fairness and A Practical Guide to Selecting Context-Appropriate Fairness Metrics in Machine Learning (2411.06624v3)

Published 10 Nov 2024 in cs.AI

Abstract: Recent regulatory proposals for artificial intelligence emphasize fairness requirements for machine learning models. However, precisely defining the appropriate measure of fairness is challenging due to philosophical, cultural and political contexts. Biases can infiltrate machine learning models in complex ways depending on the model's context, rendering a single common metric of fairness insufficient. This ambiguity highlights the need for criteria to guide the selection of context-aware measures, an issue of increasing importance given the proliferation of ever tighter regulatory requirements. To address this, we developed a flowchart to guide the selection of contextually appropriate fairness measures. Twelve criteria were used to formulate the flowchart. This included consideration of model assessment criteria, model selection criteria, and data bias. We also review fairness literature in the context of machine learning and link it to core regulatory instruments to assist policymakers, AI developers, researchers, and other stakeholders in appropriately addressing fairness concerns and complying with relevant regulatory requirements.

Summary

  • The paper provides a practical guide and flowchart to help practitioners select appropriate fairness metrics for machine learning models based on twelve context-specific criteria.
  • It reviews various types of bias in AI systems (data, algorithm, interaction) and mathematically defines common group fairness metrics like Equal Opportunity, Statistical Parity, and Equalized Odds, along with individual fairness metrics.
  • The authors highlight the complexity and context-dependence of defining fairness, review legal notions and available software toolkits, and discuss challenges like metric incompatibility and the fairness-accuracy trade-off.

This paper addresses the challenge of selecting context-appropriate fairness metrics for ML models, emphasizing the increasing importance of this issue due to growing regulatory requirements for AI. The authors note that defining fairness is complex, as it is influenced by philosophical, cultural, and political contexts, and that biases can infiltrate ML models in various ways depending on the model's context. The paper provides a flow chart to help guide the selection of appropriate fairness metrics, reviews relevant literature, and links it to core regulatory instruments to assist stakeholders in addressing fairness concerns and complying with regulations.

The authors begin by defining bias and fairness, noting that the terms are often used interchangeably, although they are not the same. Bias is defined as a systematic difference in the treatment of certain objects, people, or groups compared to others. In AI, some biases are essential for ML algorithms to function. However, unwanted biases can lead to unfair system outcomes, negatively affecting certain groups. In a social context, bias usually refers to AI system outcomes that cause injustice by unfairly discriminating against individuals or groups. Fairness is defined as a treatment, behavior, or outcome that respects established facts, beliefs, and norms and is not determined by favoritism or unjust discrimination. Because the notion of fairness is complex and context-dependent, no universal definition exists. The authors clarify that bias and fairness are related because fairness metrics can be used to identify unwanted biases in AI systems. However, whether a system is fair depends on more than just the presence of unwanted bias.

The paper then describes bias in AI systems, noting that ML systems can be susceptible to the same biases as humans because they require human decisions for their development and implementation. The authors discuss standardized bias identification methods to aid regulators in assessing model fairness. The paper also reviews different types of biases and the interactions between them. These biases are categorized based on temporal location (data bias, algorithm bias, and user interaction bias), as well as based on where in the AI development process they occur.

The authors elaborate on various types of data biases, such as:

  • Measurement selection bias: This occurs when predictive features are chosen and measured in a way that leads to distortions or inaccuracies in results between protected and unprotected groups.
  • Omitted variable bias: This happens when important variables are excluded from the model.
  • Sampling and representation bias: This arises when the sampled population does not represent the target population or when the sample underrepresents part of the target population.
  • Missing data bias: This occurs when missing data in datasets is correlated with protected groups.
  • Aggregation bias: This happens when false conclusions about individuals are drawn from observations of group-level variables.

The paper also explains algorithm biases that can affect user behavior, including:

  • Algorithm bias: This occurs when a model minimizes average error, fitting the model to the most typical members of a population.
  • Evaluation bias: This happens when benchmark data used to assess a particular task does not match the target population.
  • Popularity bias: This arises when the more an item is exposed, the more popular it becomes due to increased visibility.
  • Decision bias: This occurs during the selection of design metrics while building an ML model.

The authors also describe how user interaction can lead to biases that re-enter the system:

  • Historical bias: This occurs when historically biased human decisions are used to generate data.
  • Temporal bias: This arises from variations in behavior over time.
  • Population bias: This happens when the sampled demographic does not represent the entire population's characteristics.
  • Confirmation bias: This is the unconscious promotion of data, processes, or model output interpretation that confirms the researcher's preconceptions.

The authors discuss how the specific context of an AI implementation can affect the identification of biases, with the context including the model's requirements (input parameters, classification goals, and definitions of accuracy or fairness) and framing (consequences of use, how the model is used, and the source of training data). To address this, the paper proposes a new bias interaction loop that incorporates model context into the identification of biases.

The paper also reviews efforts to reduce bias in AI systems, noting that many fairness-enhancing methods rarely consider the causes of biases. The authors summarize studies on bias mitigation in AI models, including methods to address incomplete datasets, reduce algorithmic bias, and generate unbiased datasets.

The authors highlight that definitions of fairness differ due to philosophical, religious, cultural, social, historical, political, legal, and ethical factors. They also note that most fairness definitions have been developed by Western researchers, leading to a relatively Euro-centric perspective in fairness research. The paper reviews legal notions of fairness in the U.S. (disparate treatment and disparate impact) and in New Zealand, which prevents unfair treatment on the basis of "irrelevant personal characteristics." The paper also notes the existence of several software toolkits designed to aid in the assessment of fairness in ML models such as AIF360, FairLearn, TensorFlow Responsible AI, Aequitas, and Themis-ML.

The paper presents a systematic review of recent literature to identify the most popular fairness definitions currently in use. The authors categorized the fairness notions into observational vs. causal measures and into individual or group definitions. Observational measures only consider the data, whereas causal measures consider how data is generated. The review found that equal opportunity (EOP), statistical parity (SP) and equalized odds (EO) were the most commonly used metrics. The paper then provides mathematical definitions for various group fairness metrics with binary outputs, including:

  • Equal Opportunity (EOP): EOP=TPR=TPTP+FNEOP = TPR = \frac{TP}{TP + FN}
    • TPRTPR: True positive rate
    • TPTP: True positives
    • FNFN: False negatives
  • Statistical Parity (SP): SP=TP+FPTP+FP+TN+FNSP = \frac{TP + FP}{TP + FP + TN + FN}
    • FPFP: False positives
    • TNTN: True negatives
  • Equalized Odds (EO): EO={FPR,FNR}={FPTN+FP,FNTP+FN}EO = \{FPR, FNR\} = \{\frac{FP}{TN + FP}, \frac{FN}{TP + FN}\}
    • FPRFPR: False positive rate
    • FNRFNR: False negative rate
  • Predictive Parity (PP): PP={PPV,NPV}={TPTP+FP,TNTN+FN}PP = \{PPV, NPV\} = \{\frac{TP}{TP + FP}, \frac{TN}{TN + FN}\}
    • PPVPPV: Positive predictive value
    • NPVNPV: Negative predictive value
  • Balanced Group Balanced Accuracy (BG-BACC): BGBACC=12(Recall+Specificity)BG-BACC = \frac{1}{2}(Recall + Specificity)
    • RecallRecall: TPTP+FN\frac{TP}{TP + FN}
    • SpecificitySpecificity: TNTN+FP\frac{TN}{TN + FP}
  • Balanced Group Accuracy (BG-ACC): BGACC=TN+TPTN+TP+FN+FPBG-ACC = \frac{TN + TP}{TN + TP + FN + FP}
  • Equal Mis-Opportunity (EMO): EMO=FPTN+FPEMO = \frac{FP}{TN + FP}
  • Average Odds (AO): AO=12(FPTN+FP+TPTP+FN)AO = \frac{1}{2}(\frac{FP}{TN + FP} + \frac{TP}{TP + FN})
  • Balanced Group F1 (BG-F1): F1=2RecallPrecisionRecall+PrecisionF1 = \frac{2 \cdot Recall \cdot Precision}{Recall + Precision}
    • PrecisionPrecision: TPTP+FP\frac{TP}{TP+FP}

Additionally, the paper provides mathematical definitions for group fairness metrics with regressive outputs:

  • Balanced Group AUC (BG-AUC): AUC=TPRd(FPR)AUC = \int TPR d(FPR)
  • Calibration (CAL): CAL=1Mi=1MYiY(Yi1...Yi)CAL = \frac{1}{M} \sum_{i=1}^{M} |Y_i - \overline{Y}(Y_{i-1}... Y_i)|
    • MM: number of bins
    • YiY_i: prediction for individual
    • Y(Yi1...Yi)\overline{Y}(Y_{i-1}... Y_i): average true outcome for cases within the i-th bin.
  • Balance (BAL): BAL=1N1Y=1Y1N0Y=0YBAL = | \frac{1}{N_1} \sum_{Y=1} \overline{Y} - \frac{1}{N_0} \sum_{Y=0} \overline{Y} |
    • Y\overline{Y}: predicted probability
    • N1N_1: number of cases where outcome Y=1Y = 1
    • N0N_0: number of cases where outcome Y=0Y = 0

Finally, the paper provides a mathematical definition for individual fairness metrics:

  • Fairness Through Awareness (FTA): FTA=KNNC=11Ni=1N1kjKNN(xi)YiYjFTA = KNNC = 1 - \frac{1}{N} \sum_{i=1}^{N} \frac{1}{k} \sum_{j \in KNN(x_i)} |Y_i - Y_j|
    • NN: number of data points
    • kk: number of neighbors
    • YiY_i: outcome for a certain data point
    • YjY_j: outcomes of datapoints closest to xix_i

The authors discuss the challenges associated with measuring fairness in ML models, which include the difficulty of identifying marginalized groups, the constraints imposed by unequal base rates, the incompatibility between different fairness metrics, and the fairness-accuracy trade-off. The paper emphasizes the need for a framework for selecting the most appropriate fairness metric for a given context.

To address the issue of selecting context-appropriate fairness measures, the paper introduces a flowchart with twelve criteria. These criteria include:

  1. Assessing data vs. assessing outcome
  2. Continuous prediction, classification, or generative modeling
  3. Biased data
  4. Distance metric available
  5. Equity requirements
  6. Classification models output binary or regressive values
  7. Threshold fixed or floating
  8. Base rates equal
  9. Emphasis on precision or recall
  10. Emphasis on false positives (FP) or false negatives (FN)
  11. Emphasis on the positive class or negative class
  12. Balanced dataset

The paper provides three examples of how the flowchart can be used to select appropriate fairness measures:

  • Prisoner recidivism: The flowchart recommends using balanced group AUC, with a focus on recall.
  • CV evaluation: The flowchart recommends using statistical parity, since equity is required.
  • Spam filtering of political emails: The flowchart recommends using balanced group accuracy.

The authors acknowledge some caveats about the selection method, noting that the portability of fairness solutions designed for one context to another is a failure mode that researchers often encounter. The authors emphasize that all fairness metrics should be considered within their appropriate philosophical and cultural framework.

In conclusion, the paper provides a comprehensive review of fairness in ML, focusing on the selection of context-appropriate fairness metrics and highlighting the importance of addressing the issue of fairness in light of increasing regulatory pressures.