Emergent Mind


This paper deals with optimal policy learning (OPL) with observational data, i.e. data-driven optimal decision-making, in multi-action (or multi-arm) settings, where a finite set of decision options is available. It is organized in three parts, where I discuss respectively: estimation, risk preference, and potential failures. The first part provides a brief review of the key approaches to estimating the reward (or value) function and optimal policy within this context of analysis. Here, I delineate the identification assumptions and statistical properties related to offline optimal policy learning estimators. In the second part, I delve into the analysis of decision risk. This analysis reveals that the optimal choice can be influenced by the decision maker's attitude towards risks, specifically in terms of the trade-off between reward conditional mean and conditional variance. Here, I present an application of the proposed model to real data, illustrating that the average regret of a policy with multi-valued treatment is contingent on the decision-maker's attitude towards risk. The third part of the paper discusses the limitations of optimal data-driven decision-making by highlighting conditions under which decision-making can falter. This aspect is linked to the failure of the two fundamental assumptions essential for identifying the optimal choice: (i) overlapping, and (ii) unconfoundedness. Some conclusions end the paper.
Comparison of actual versus optimal action allocation under a risk-neutral setting with offline learning.


  • This paper presents an in-depth study on optimal policy learning (OPL) in multi-action scenarios, focusing on estimation techniques, risk preference, and identifying potential failures.

  • Discusses three main estimation methods for optimal policy learning: Regression Adjustment (RA), Inverse Probability Weighting (IPW), and Doubly-Robust (DR), and their reliance on the assumptions of unconfoundedness and overlap.

  • Highlights the significance of incorporating decision-makers' risk tolerance into the policy learning process, showing how risk preference can alter the optimal policy choice.

  • Examines conditions such as weak overlap and unconfoundedness that can lead to failures in optimal policy learning, offering strategies to mitigate these issues.


The task of decision-making over finite alternatives spans across various domains, necessitating the adoption of sophisticated techniques for optimal policy learning (OPL). This paper delves into optimal policy learning with observational data within multi-action settings, presenting a cohesive exploration of estimation processes, risk preference modulation, and the identification of potential failures in data-driven decision-making frameworks.

Estimation of Optimal Policy and Reward Function

The paper initiates its discourse by outlining the foundational elements required for the estimation of reward functions and optimal policies in multi-action scenarios. Following a structured approach, it defines a policy as a decision rule mapping environmental signals to actions, alongside formulating the value function as an indicator of welfare or reward achieved by a policy. Two crucial assumptions underpinning the identification of optimal choices are presented: unconfoundedness and overlapping, which are integral for ensuring the statistical robustness of offline optimal policy learning estimators.

Approaches to Estimation

Three principal methods for estimating the value function are expounded upon:

  1. Regression Adjustment (RA): This direct method leans on regression estimates of potential outcomes, its consistency hinging on the correct specification of the regression model.
  2. Inverse Probability Weighting (IPW): Utilizing the observed outcome directly, this method weights observations based on the propensity score, biome biased if the propensity score model is misspecified.
  3. Doubly-Robust (DR): This estimator amalgamates elements of RA and IPW methods, requiring only one of the two, either the propensity score or the conditional mean, to be correctly specified for consistency.

Decision Risk Analysis

The exploration transitions into analyzing decision risk, highlighting how the optimal choice can pivot based on the decision-maker's risk tolerance - a trade-off between the mean and variance of the reward. This distinction demonstrates that a purely objective, data-driven approach to decision-making is insufficient, emphasizing the necessity of incorporating risk attitudes in policy evaluation.

Limitations of Optimal Data-Driven Decision-Making

A critical examination reveals conditions fostering failures in optimal policy detection:

  • Weak Overlap: Insufficient overlap in the data can lead to erroneous imputations of conditional expectations of outcomes resulting in sub-optimal decision making.

  • Weak Unconfoundedness: The presence of unobservable factors influencing both the decision and outcome variables undermines the assumption of conditional randomization, potentially skewing optimal action detection.

Addressing Weak Overlap and Unconfoundedness

The paper underscores strategies such as collecting more contextual data, employing methods robust to unobservable selection, conducting sensitivity analysis, and relying on prior knowledge and assumptions to mitigate the implications of weak overlap and unconfoundedness.

Practical Implications and Future Directions

The paper subtly suggests that the choice of estimator (RA, IPW, or DR) and the incorporation of risk preferences are crucial elements shaping the efficacy and applicability of optimal policy learning frameworks. It posits that future research could focus on refining estimation techniques to better accommodate varying degrees of risk tolerance and explore mechanisms to overcome limitations posed by weak overlap and unconfoundedness.


By articulately presenting the estimation techniques for optimal policy learning, examining the role of risk preferences, and identifying potential pitfalls associated with observational data, this paper contributes valuable insights into the field of data-driven decision-making. It invites continued exploration into enhancing the robustness of policy learning methodologies, thereby extending the frontiers of optimal decision-making in multi-action scenarios.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!