Offline Policy Evaluation

Updated 5 August 2025

Offline policy evaluation is a method to estimate a target policy’s long-run performance using historical data from different behavior policies without online testing.
It rigorously compares estimators like importance sampling and regression, quantifying risks such as variance due to data sparsity and model misspecification.
The framework extends to contextual bandits and MDPs, highlighting finite-sample limitations and the constrained benefits of semi-supervised improvements.

Offline policy evaluation (OPE) encompasses the statistical task of estimating the expected performance (e.g., long-run average reward, expected cumulative reward) of a target policy using data collected from one or more different policies, without further online interaction. OPE is essential in scenarios—common in healthcare, robotics, operations research, recommendation systems, and digital marketing—where deploying untested policies can incur significant risk, cost, or ethical constraints. OPE presents deep challenges, including limited support of the state–action space, distributional shift, finite-sample effects, realistic model misspecification, and the possibility of hidden confounding.

1. Problem Formulation and Minimax Lower Bounds

Offline policy evaluation is typically formalized as follows. Data $D^n = \{(A_i, R_i)\}_{i=1}^n$ is collected under a known behavior policy $\pi_D$ , with $A_i \sim \pi_D$ , and $R_i$ drawn from an unknown reward distribution $\Phi(\cdot|A_i)$ . The target policy $\pi$ is known, and the goal is to estimate its value $v^\pi_\Phi = \mathbb{E}_{A\sim\pi, R\sim\Phi(\cdot|A)}[R]$ as accurately as possible.

A fundamental contribution (Li et al., 2014) is the derivation of a minimax risk lower bound for the mean-squared error (MSE) of any possible estimator in the multi-armed bandit setting: $R^*_n(\pi, \pi_D, \mathrm{max}, \sigma^2) \geq \frac{1}{4} \cdot \max \left\{ \mathrm{max}^2 \cdot \max_{B \subset A} [\pi^2(B) \cdot p_{B,n}],~ \frac{V_1}{n} \right\}$ where $V_1 = \sum_a \frac{\pi^2(a)}{\pi_D(a)} \sigma_{\Phi}^2(a)$ and $p_{B,n} = (1 - \pi_D(B))^n$ . This formula precisely quantifies the impact of both unexplored actions (missing mass) and the inherent statistical uncertainty in estimating rewards.

A central phenomenon revealed is a regime shift: when $n$ is small relative to $|A|$ , "missing data" (actions possibly never observed) dominates risk, and for larger $n$ the classical parametric rate $V_1/n$ dominates. Asymptotically, the minimax risk approaches $V_1/n$ .

2. Estimator Analysis: Importance Sampling and Regression

Two standard OPE estimators are rigorously analyzed (Li et al., 2014):

Likelihood-Ratio (LR) / Importance Sampling:

$v_\mathrm{LR} = \frac{1}{n}\sum_{i=1}^n \frac{\pi(A_i)}{\pi_D(A_i)} R_i$

It is unbiased, but its MSE includes an extra term $V_2/n$ (from reward mean variability), so

$\mathrm{MSE}(v_\mathrm{LR}) = \frac{V_1 + V_2}{n}$

When $V_2$ is large (e.g., the target puts mass on actions rarely sampled under $\pi_D$ ), LR can perform arbitrarily worse than the minimax lower bound, despite its popularity.

Regression Estimator (REG):

$r(a) = \begin{cases} \frac{R(a)}{n(a)} & \text{if } n(a)>0\ 0 & \text{otherwise} \end{cases} ~;~~ v_\mathrm{REG} = \sum_a \pi(a) r(a)$

The REG estimator is shown to be near–minimax optimal (constant-factor) in MSE:

$\mathrm{MSE}(v_\mathrm{REG}) \leq C \cdot R^*_n(\pi, \pi_D, \mathrm{max}, \sigma^2)$

and attains the minimax lower bound asymptotically. Thus, REG is robust to the misspecified behavior distribution and less sensitive to extreme importance weights.

A corollary is that knowledge of the "true" behavior policy—e.g., as might be improved via semi-supervised learning—does not improve REG's finite-sample performance, contrasting with the intuitions about the benefits of better propensity estimation via unlabeled data.

3. Extensions: Contextual Bandits and Markov Decision Processes

The theoretical framework extends well beyond the basic bandit case (Li et al., 2014):

Contextual Bandits: Each context–action pair $(x, a)$ is treated as an “augmented” action; the same lower bound and estimator analyses apply when the context distribution $\mu$ and target $\pi(a|x)$ are known. LR and REG are adapted using an importance ratio $\pi(a|x)/\pi_D(a|x)$ , and finite-sample lower bounds again follow, using the product space $X \times A$ .
Finite-Horizon MDPs: In fixed-horizon MDPs, a trajectory (or state–action sequence) is regarded as an action. The space scales exponentially: $|\mathcal{S}|^{H+1} \cdot |\mathcal{A}|^H$ . The minimax risk and estimator behavior generalize in a formally identical way, but the lower bound increases exponentially in the planning horizon $H$ unless the data is exponentially large, demonstrating that OPE is intrinsically "hard" for long-horizon problems unless data is extremely abundant.

A key insight is that these general cases can often be reduced to the analysis of the bandit setting over an appropriate augmented action space.

4. Implications for Semi-Supervised Learning

The minimax analysis (Li et al., 2014) provides a counterpoint to the application of semi-supervised learning in OPE. While unlabeled "action-only" data can improve estimation of the behavior policy $\pi_D$ , the regression estimator’s finite-sample performance depends only on the empirical frequencies, not on the explicit use of $\pi_D$ . It is shown that, even if the behavior policy is estimated perfectly from large unlabeled data, the benefit does not extend to outperforming the regression estimator. Thus, for OPE in bandit, contextual, and MDP settings, the potential gains from supplementing with large unlabelled data to better estimate propensities are inherently limited.

5. Key Mathematical Results

Critical results established in (Li et al., 2014) include:

Estimator	MSE Formula	Minimax-Optimality
LR (Importance Ratio)	$\frac{V_1 + V_2}{n}$	No: can be arbitrarily suboptimal
REG (Regression)	$\leq C \cdot R^*_n$ ; $\to V_1/n$ rate	Yes: near-optimal up to $C$
Minimax Risk Lower Bound	$\frac{1}{4} \max\{\mathrm{max}^2 \max_B [\pi^2(B) p_{B,n}], V_1/n\}$	Fundamental for all estimators

The full set of formulas (see (Li et al., 2014)) precisely quantify every relevant regime.

6. Sample Size and "Missing Mass" Phenomena

A defining feature of the minimax lower bound is the explicit effect of data sparsity—the “missing mass.” For small $n$ , most actions may not be observed, so for any estimator, the worst-case MSE is driven by this phenomenon. As $n$ increases and samples cover all actions with high probability, estimation risk transitions to being controlled by the variance-term $V_1/n$ . This transition is a universal property across bandits, contextual bandits, and MDPs.

There is an inherent statistical limit: when $n = O(\sqrt{K})$ (where $K$ is the number of arms/actions), no estimator can beat a constant-order risk, even with perfect side information.

7. Broader Impact and Theoretical Guidance

The minimax framework and estimator analyses of (Li et al., 2014) impose strict performance limits on offline policy evaluation—the risk cannot be avoided by algorithmic or statistical means in finite samples if the data coverage is insufficient. Standard importance sampling–based estimators, although unbiased, can be highly suboptimal due to extra variance terms. Regression-style estimators, using observed averages reweighted by the target policy, circumvent this extra risk and are provably (near-)optimal.

The analytical framework generalizes to contextual bandits and fixed-horizon MDPs (by "lifting" contexts or trajectories to the action space), clarifies why OPE is especially challenging for long-horizon or high-dimensional problems, and establishes that the benefits of integrating large unlabeled data for better behavior policy estimation are fundamentally limited under practical sample constraints.

For practitioners and theorists, these results guide both the selection of estimators and expectations for achievable error: finite-sample coverage, horizon length, and statistical variability jointly define the boundary of reliable offline evaluation in reinforcement learning and related decision-making settings.

PDF Markdown Chat (Pro)

References (1)

On Minimax Optimal Offline Policy Evaluation (2014)

Follow Topic

Get notified by email when new papers are published related to Offline Policy Evaluation.