Assessing Algorithmic Fairness with Unobserved Protected Class Using Data Combination (1906.00285v2)

Published 1 Jun 2019 in stat.ML, cs.LG, and math.OC

Abstract: The increasing impact of algorithmic decisions on people's lives compels us to scrutinize their fairness and, in particular, the disparate impacts that ostensibly-color-blind algorithms can have on different groups. Examples include credit decisioning, hiring, advertising, criminal justice, personalized medicine, and targeted policymaking, where in some cases legislative or regulatory frameworks for fairness exist and define specific protected classes. In this paper we study a fundamental challenge to assessing disparate impacts in practice: protected class membership is often not observed in the data. This is particularly a problem in lending and healthcare. We consider the use of an auxiliary dataset, such as the US census, to construct models that predict the protected class from proxy variables, such as surname and geolocation. We show that even with such data, a variety of common disparity measures are generally unidentifiable, providing a new perspective on the documented biases of popular proxy-based methods. We provide exact characterizations of the tightest-possible set of all possible true disparities that are consistent with the data (and possibly any assumptions). We further provide optimization-based algorithms for computing and visualizing these sets and statistical tools to assess sampling uncertainty. Together, these enable reliable and robust assessments of disparities -- an important tool when disparity assessment can have far-reaching policy implications. We demonstrate this in two case studies with real data: mortgage lending and personalized medicine dosing.

PDF Abstract

Assessing Algorithmic Fairness with Unobserved Protected Class Using Data Combination

The paper by Nathan Kallus, Xiaojie Mao, and Angela Zhou addresses a significant challenge in assessing algorithmic fairness: the unobserved nature of protected class membership, such as race or gender, in many datasets. This issue is prevalent in critical domains like lending and healthcare, where fairness evaluation is paramount but direct observations of protected classes are often missing due to legal or practical constraints. The authors propose using an auxiliary dataset, such as the US census, to infer the protected class via proxy variables, including surname and geolocation, to assess algorithmic fairness more effectively.

Key Contributions

Problem Formulation:
- The authors formulate the assessment of algorithmic fairness with unobserved protected classes as a data combination problem involving two datasets: a primary dataset without protected class labels and an auxiliary dataset with protected class labels. This bifurcation is foundational for understanding the unidentifiability issues inherent in the problem.
Identification Conditions:
- The paper provides a detailed analysis of the identifiability conditions under which fairness metrics, like disparate impact, are assessable from the available data. Identifiability is shown to be unachievable without making strong assumptions or having highly informative proxy data.
Characterizing Partial Identification Sets:
- By employing optimization-based methodologies, the authors precisely characterize the partial identification sets for various disparity measures, allowing the determination of fairness limits under current data constraints.
Methodology for Estimation and Inference:
- Through developing statistical tools that acknowledge the sampling uncertainties, the paper offers methods to compute the partial identification sets, which account for both uncertainty and ambiguity in fairness assessments.
Empirical Applications:
- Real-world case studies in mortgage lending and personalized medicine demonstrate the approach's applicability, showcasing how one can use the presented methods to obtain robust fairness assessments in practical scenarios.

Practical Implications and Future Directions

This paper's methodologies serve as critical tools for industries where fairness is vital yet challenging to quantify due to incomplete data on protected classes. Deploying these methods can support reliable conclusions about disparities when direct observations of protected attributes are unavailable.

The research opens pathways for future works to enhance the robustness of fairness assessments in machine learning models. Future directions may include refining algorithms for inferring protected classes, incorporating more sophisticated proxy measures, or extending the methodologies to account for intersectional group analyses. This paper lays a foundation for tackling fairness in the absence of complete data, emphasizing the importance of transparent methodological frameworks.

In conclusion, the research by Kallus, Mao, and Zhou contributes significantly to the discourse on algorithmic fairness, proposing innovative solutions for challenges posed by unobserved protected classes. This work is indispensable for policymakers, practitioners, and researchers dedicated to advancing fairness in algorithmic decision-making without the complete demographic data that such assessments ideally require.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Nathan Kallus (133 papers)
Xiaojie Mao (21 papers)
Angela Zhou (23 papers)

Citations (146)

View on Semantic Scholar

Assessing Algorithmic Fairness with Unobserved Protected Class Using Data Combination (1906.00285v2)