Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Counterfactually Fair Reinforcement Learning via Sequential Data Preprocessing (2501.06366v2)

Published 10 Jan 2025 in stat.ML, cs.CY, cs.LG, and stat.ME

Abstract: When applied in healthcare, reinforcement learning (RL) seeks to dynamically match the right interventions to subjects to maximize population benefit. However, the learned policy may disproportionately allocate efficacious actions to one subpopulation, creating or exacerbating disparities in other socioeconomically-disadvantaged subgroups. These biases tend to occur in multi-stage decision making and can be self-perpetuating, which if unaccounted for could cause serious unintended consequences that limit access to care or treatment benefit. Counterfactual fairness (CF) offers a promising statistical tool grounded in causal inference to formulate and study fairness. In this paper, we propose a general framework for fair sequential decision making. We theoretically characterize the optimal CF policy and prove its stationarity, which greatly simplifies the search for optimal CF policies by leveraging existing RL algorithms. The theory also motivates a sequential data preprocessing algorithm to achieve CF decision making under an additive noise assumption. We prove and then validate our policy learning approach in controlling unfairness and attaining optimal value through simulations. Analysis of a digital health dataset designed to reduce opioid misuse shows that our proposal greatly enhances fair access to counseling.

Summary

  • The paper extends counterfactual fairness to sequential decision-making, formulating a generalized definition for CMDPs that ensures invariant actions across counterfactual sensitive attributes.
  • It proposes a sequential data preprocessing algorithm to estimate counterfactual states and rewards, enabling the use of standard offline RL methods for fair policy learning.
  • Theoretical guarantees and empirical results on synthetic and real datasets demonstrate the method’s ability to balance fairness with cumulative reward performance.

This paper, "Counterfactually Fair Reinforcement Learning via Sequential Data Preprocessing" (2501.06366), addresses the critical issue of fairness in dynamic, sequential decision-making systems governed by Reinforcement Learning (RL). It highlights that standard RL methods, optimized for maximizing overall rewards, can inadvertently lead to discriminatory outcomes, particularly when sensitive attributes like race or gender are correlated with states or rewards. The paper focuses on Counterfactual Fairness (CF), an individual-level fairness notion rooted in causal inference, which requires that a decision for an individual would be the same had their sensitive attribute been different, while keeping everything else about them the same.

Applying CF in sequential decision-making, modeled as a Contextual Markov Decision Process (CMDP), is challenging. Sensitive attributes (Z) can influence states (S) and rewards (R) not just directly, but also indirectly over time through the causal structure. Standard RL policies might rely on these state variables, inheriting the bias embedded by the sensitive attribute. Furthermore, it is not immediately obvious if CF policies in dynamic settings can retain desirable properties like stationarity, which are crucial for efficient policy learning in RL.

Motivated by a real-world digital health paper aimed at reducing opioid misuse (the PowerED paper), where allocating scarce counselor time needs to be fair across patient subgroups, the authors propose a novel framework for achieving CF in CMDPs. Their key contributions are:

  1. A Generalized CF Definition for CMDPs: They extend the single-stage CF definition to the multi-stage setting (Definition 2). For a decision rule at time t, CF requires that the action distribution remains invariant if the individual's sensitive attribute Z is changed counterfactually, given the observed history up to time t. This formulation is crucial because it fixes the past sequence of observed actions, making the definition independent of the behavior policy that generated the data.
  2. Theoretical Characterization of CF Policies: The paper theoretically shows that a policy satisfies CF if its decisions at time t depend only on the set of counterfactual states and rewards at time t across all possible values of the sensitive attribute (Theorem 1). Furthermore, they prove that under a stationary CMDP, the optimal CF policy within the class of history-dependent CF policies is actually stationary and depends only on the current set of counterfactual states (Theorem 2). This significantly reduces the search space for the optimal CF policy, allowing the use of standard stationary RL algorithms.
  3. A Sequential Data Preprocessing Algorithm: Since counterfactual states and rewards are not observed, they propose Algorithm 1 to estimate these quantities from observational data. This algorithm sequentially processes the data from t=1 to T. At each step t, it estimates the counterfactual state S_t and reward R_{t-1} for each possible value of the sensitive attribute Z', leveraging an additive noise assumption (Assumption 4). The algorithm estimates the mean transition dynamics and then uses these estimates to reconstruct the "noise" component, which is assumed to be independent of Z. By adding this reconstructed noise to the estimated mean state/reward under the counterfactual Z', the counterfactual state/reward is obtained. The preprocessed data consists of tuples (tilde_s_it, a_it, tilde_r_it), where tilde_s_it is a vector containing the estimated counterfactual states for all Z' values, and tilde_r_it is a weighted average of the estimated counterfactual rewards across all Z' values.
  4. Policy Learning with Preprocessed Data: The preprocessed data {(tilde_s_it, a_it, tilde_r_it)} forms a standard MDP dataset, where the state space is augmented to include the counterfactual states. Any existing offline RL algorithm, such as Fitted Q-Iteration (FQI, Algorithm 2), can then be applied directly to this preprocessed dataset to learn a stationary CF policy π^\hat{\pi} that takes tilde_s_it as input.
  5. Theoretical Guarantees: The paper provides theoretical guarantees on the performance of the policy learned using their preprocessing method and FQI.
    • Regret Bound (Theorem 3): The bound on the difference between the optimal value and the learned policy's value depends on the error in estimating counterfactual states/rewards (epsilon), the FQI approximation error, and the number of FQI iterations. This shows that the learned policy is near-optimal if the counterfactuals can be estimated accurately and FQI converges well.
    • Unfairness Control (Theorem 4): The bound on the CF metric (Equation 4), which measures the disparity in action distributions across counterfactual worlds, depends on the FQI error and the error in estimating the counterfactual states. This confirms that accurate estimation of counterfactuals and good FQI performance lead to lower unfairness.

For practical implementation, the sequential data preprocessing (Algorithm 1) is performed once on the observational dataset. The resulting preprocessed dataset is then used to train the policy using an offline RL algorithm. During deployment, to make a decision at time t for an individual, their counterfactual states tilde_s_t must be estimated sequentially. This involves using the observed history up to t, the previously estimated counterfactual states tilde_s_{t-1}, and the estimated transition function to compute the current counterfactual state vector tilde_s_t. The learned policy then takes tilde_s_t as input to select an action. This implies the need to store the estimated tilde_s_{t-1} during deployment.

The paper evaluates the proposed approach through synthetic and semi-synthetic experiments, mirroring the structure of the PowerED paper. They compare their method against baselines including a "Full" policy (uses sensitive attribute, high value but unfair), an "Unaware" policy (ignores sensitive attribute, still potentially unfair), and an "Oracle" policy (uses true counterfactuals, ideal). Results show that their approach significantly reduces counterfactual unfairness compared to Full and Unaware policies, especially as the sensitive attribute's influence on states increases. While achieving lower unfairness, their method often yields slightly lower cumulative reward than the Full policy, illustrating a common fairness-value trade-off.

Finally, they apply the method to the real PowerED dataset, using education, age, sex, and ethnicity as sensitive attributes and weekly pain/interference scores as states. The results (Table 1) demonstrate that their sequential preprocessing method combined with offline RL achieves substantially lower counterfactual unfairness across all sensitive attributes compared to Full and Unaware policies, while maintaining a high level of cumulative reward, close to that of the Full policy. This suggests that the method can enhance fair access to interventions (like human counselors) in real-world settings without a significant loss in overall population benefit.

The paper acknowledges limitations, particularly the restrictiveness of the additive noise assumption for estimating counterfactuals, and suggests future work could explore more flexible causal inference techniques like VAEs or adversarial training to relax this. It also discusses the choice of CF definition and the trade-off between different policy classes in terms of fairness and value. Overall, the paper provides a theoretically grounded and empirically validated approach for incorporating counterfactual fairness into sequential decision-making systems using a practical data preprocessing strategy amenable to existing offline RL algorithms.

X Twitter Logo Streamline Icon: https://streamlinehq.com