Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
24 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
85 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
221 tokens/sec
2000 character limit reached

Offline Policy Evaluation

Updated 5 August 2025
  • Offline policy evaluation is a method to estimate a target policy’s long-run performance using historical data from different behavior policies without online testing.
  • It rigorously compares estimators like importance sampling and regression, quantifying risks such as variance due to data sparsity and model misspecification.
  • The framework extends to contextual bandits and MDPs, highlighting finite-sample limitations and the constrained benefits of semi-supervised improvements.

Offline policy evaluation (OPE) encompasses the statistical task of estimating the expected performance (e.g., long-run average reward, expected cumulative reward) of a target policy using data collected from one or more different policies, without further online interaction. OPE is essential in scenarios—common in healthcare, robotics, operations research, recommendation systems, and digital marketing—where deploying untested policies can incur significant risk, cost, or ethical constraints. OPE presents deep challenges, including limited support of the state–action space, distributional shift, finite-sample effects, realistic model misspecification, and the possibility of hidden confounding.

1. Problem Formulation and Minimax Lower Bounds

Offline policy evaluation is typically formalized as follows. Data Dn={(Ai,Ri)}i=1nD^n = \{(A_i, R_i)\}_{i=1}^n is collected under a known behavior policy πD\pi_D, with AiπDA_i \sim \pi_D, and RiR_i drawn from an unknown reward distribution Φ(Ai)\Phi(\cdot|A_i). The target policy π\pi is known, and the goal is to estimate its value vΦπ=EAπ,RΦ(A)[R]v^\pi_\Phi = \mathbb{E}_{A\sim\pi, R\sim\Phi(\cdot|A)}[R] as accurately as possible.

A fundamental contribution (Li et al., 2014) is the derivation of a minimax risk lower bound for the mean-squared error (MSE) of any possible estimator in the multi-armed bandit setting: Rn(π,πD,max,σ2)14max{max2maxBA[π2(B)pB,n], V1n}R^*_n(\pi, \pi_D, \mathrm{max}, \sigma^2) \geq \frac{1}{4} \cdot \max \left\{ \mathrm{max}^2 \cdot \max_{B \subset A} [\pi^2(B) \cdot p_{B,n}],~ \frac{V_1}{n} \right\} where V1=aπ2(a)πD(a)σΦ2(a)V_1 = \sum_a \frac{\pi^2(a)}{\pi_D(a)} \sigma_{\Phi}^2(a) and pB,n=(1πD(B))np_{B,n} = (1 - \pi_D(B))^n. This formula precisely quantifies the impact of both unexplored actions (missing mass) and the inherent statistical uncertainty in estimating rewards.

A central phenomenon revealed is a regime shift: when nn is small relative to A|A|, "missing data" (actions possibly never observed) dominates risk, and for larger nn the classical parametric rate V1/nV_1/n dominates. Asymptotically, the minimax risk approaches V1/nV_1/n.

2. Estimator Analysis: Importance Sampling and Regression

Two standard OPE estimators are rigorously analyzed (Li et al., 2014):

  • Likelihood-Ratio (LR) / Importance Sampling:

vLR=1ni=1nπ(Ai)πD(Ai)Riv_\mathrm{LR} = \frac{1}{n}\sum_{i=1}^n \frac{\pi(A_i)}{\pi_D(A_i)} R_i

It is unbiased, but its MSE includes an extra term V2/nV_2/n (from reward mean variability), so

MSE(vLR)=V1+V2n\mathrm{MSE}(v_\mathrm{LR}) = \frac{V_1 + V_2}{n}

When V2V_2 is large (e.g., the target puts mass on actions rarely sampled under πD\pi_D), LR can perform arbitrarily worse than the minimax lower bound, despite its popularity.

  • Regression Estimator (REG):

r(a)={R(a)n(a)if n(a)>0 0otherwise ;  vREG=aπ(a)r(a)r(a) = \begin{cases} \frac{R(a)}{n(a)} & \text{if } n(a)>0\ 0 & \text{otherwise} \end{cases} ~;~~ v_\mathrm{REG} = \sum_a \pi(a) r(a)

The REG estimator is shown to be near–minimax optimal (constant-factor) in MSE:

MSE(vREG)CRn(π,πD,max,σ2)\mathrm{MSE}(v_\mathrm{REG}) \leq C \cdot R^*_n(\pi, \pi_D, \mathrm{max}, \sigma^2)

and attains the minimax lower bound asymptotically. Thus, REG is robust to the misspecified behavior distribution and less sensitive to extreme importance weights.

A corollary is that knowledge of the "true" behavior policy—e.g., as might be improved via semi-supervised learning—does not improve REG's finite-sample performance, contrasting with the intuitions about the benefits of better propensity estimation via unlabeled data.

3. Extensions: Contextual Bandits and Markov Decision Processes

The theoretical framework extends well beyond the basic bandit case (Li et al., 2014):

  • Contextual Bandits: Each context–action pair (x,a)(x, a) is treated as an “augmented” action; the same lower bound and estimator analyses apply when the context distribution μ\mu and target π(ax)\pi(a|x) are known. LR and REG are adapted using an importance ratio π(ax)/πD(ax)\pi(a|x)/\pi_D(a|x), and finite-sample lower bounds again follow, using the product space X×AX \times A.
  • Finite-Horizon MDPs: In fixed-horizon MDPs, a trajectory (or state–action sequence) is regarded as an action. The space scales exponentially: SH+1AH|\mathcal{S}|^{H+1} \cdot |\mathcal{A}|^H. The minimax risk and estimator behavior generalize in a formally identical way, but the lower bound increases exponentially in the planning horizon HH unless the data is exponentially large, demonstrating that OPE is intrinsically "hard" for long-horizon problems unless data is extremely abundant.

A key insight is that these general cases can often be reduced to the analysis of the bandit setting over an appropriate augmented action space.

4. Implications for Semi-Supervised Learning

The minimax analysis (Li et al., 2014) provides a counterpoint to the application of semi-supervised learning in OPE. While unlabeled "action-only" data can improve estimation of the behavior policy πD\pi_D, the regression estimator’s finite-sample performance depends only on the empirical frequencies, not on the explicit use of πD\pi_D. It is shown that, even if the behavior policy is estimated perfectly from large unlabeled data, the benefit does not extend to outperforming the regression estimator. Thus, for OPE in bandit, contextual, and MDP settings, the potential gains from supplementing with large unlabelled data to better estimate propensities are inherently limited.

5. Key Mathematical Results

Critical results established in (Li et al., 2014) include:

Estimator MSE Formula Minimax-Optimality
LR (Importance Ratio) V1+V2n\frac{V_1 + V_2}{n} No: can be arbitrarily suboptimal
REG (Regression) CRn\leq C \cdot R^*_n; V1/n\to V_1/n rate Yes: near-optimal up to CC
Minimax Risk Lower Bound 14max{max2maxB[π2(B)pB,n],V1/n}\frac{1}{4} \max\{\mathrm{max}^2 \max_B [\pi^2(B) p_{B,n}], V_1/n\} Fundamental for all estimators

The full set of formulas (see (Li et al., 2014)) precisely quantify every relevant regime.

6. Sample Size and "Missing Mass" Phenomena

A defining feature of the minimax lower bound is the explicit effect of data sparsity—the “missing mass.” For small nn, most actions may not be observed, so for any estimator, the worst-case MSE is driven by this phenomenon. As nn increases and samples cover all actions with high probability, estimation risk transitions to being controlled by the variance-term V1/nV_1/n. This transition is a universal property across bandits, contextual bandits, and MDPs.

There is an inherent statistical limit: when n=O(K)n = O(\sqrt{K}) (where KK is the number of arms/actions), no estimator can beat a constant-order risk, even with perfect side information.

7. Broader Impact and Theoretical Guidance

The minimax framework and estimator analyses of (Li et al., 2014) impose strict performance limits on offline policy evaluation—the risk cannot be avoided by algorithmic or statistical means in finite samples if the data coverage is insufficient. Standard importance sampling–based estimators, although unbiased, can be highly suboptimal due to extra variance terms. Regression-style estimators, using observed averages reweighted by the target policy, circumvent this extra risk and are provably (near-)optimal.

The analytical framework generalizes to contextual bandits and fixed-horizon MDPs (by "lifting" contexts or trajectories to the action space), clarifies why OPE is especially challenging for long-horizon or high-dimensional problems, and establishes that the benefits of integrating large unlabeled data for better behavior policy estimation are fundamentally limited under practical sample constraints.

For practitioners and theorists, these results guide both the selection of estimators and expectations for achievable error: finite-sample coverage, horizon length, and statistical variability jointly define the boundary of reliable offline evaluation in reinforcement learning and related decision-making settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)