Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling (1906.03393v4)

Published 8 Jun 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}H\mathbb{E}_{\mu}\left[\frac{d_t\pi(s_t)2}{d_t\mu(s_t)2} \mathrm{Var}{\mu}\left[\frac{\pi_t(a_t|s_t)}{\mu_t(a_t|s_t)}\big( V{t+1}\pi(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n{-1.5}) $$ where $\mu$ and $\pi$ are the logging and target policies, $d_t{\mu}(s_t)$ and $d_t{\pi}(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}\pi$ is the value function of the MDP under $\pi$. The result matches the Cramer-Rao lower bound in \citet{jiang2016doubly} up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

Citations (169)

Summary

  • The paper introduces Marginalized Importance Sampling (MIS), a novel technique for off-policy evaluation in reinforcement learning that recursively estimates state distributions to significantly reduce variance.
  • MIS achieves a mean-squared error bound polynomial in the horizon H, demonstrating near-optimal sample complexity compared to the exponential variance of traditional importance sampling methods.
  • Empirical validation shows MIS consistently outperforms existing importance sampling estimators across various environments, offering enhanced evaluation accuracy crucial for deploying RL in practice.

Overview of Marginalized Importance Sampling for Optimal Off-Policy Evaluation in Reinforcement Learning

The paper "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling" introduces a novel approach to addressing high variance in off-policy evaluation (OPE) for reinforcement learning (RL). The authors propose the Marginalized Importance Sampling (MIS) estimator, which leverages the recursive estimation of state marginal distributions at each step, significantly reducing variance dependence on the RL horizon HH. This method is contrasted with traditional importance sampling (IS) methods, which suffer from exponential variance growth with respect to the horizon, making them less effective for long-horizon problems.

Key Contributions

  1. Marginalized Importance Sampling (MIS):
    • MIS replaces the cumulative importance weights used in standard IS methods with a recursive estimation process of state distributions under the target policy π\pi.
    • The estimator computes the importance weights based on the marginal state distributions at each step, thereby reducing variance while maintaining unbiasedness or consistency.
  2. Sample Complexity and Error Bounds:
    • The paper establishes a mean-squared error (MSE) bound for MIS that is polynomial in HH, a substantial improvement over the exponential bounds in IS methods.
    • The MSE matches the Cramer-Rao lower bound up to a factor of HH, suggesting near-optimality.
    • The results are robust across different environments, including time-varying, partially observable, and long-horizon RL settings.
  3. Empirical Validation:
    • The empirical superiority of MIS is demonstrated across several domains, including ModelWin and ModelFail MDPs, time-varying non-mixing MDPs, and the Mountain Car control task.
    • MIS consistently outperforms traditional IS-based estimators, showcasing lower variance and enhanced evaluation accuracy.

Implications and Future Directions

The development of MIS represents a significant advancement in the toolkit available for OPE in RL, providing a practical solution to the high-variance problem that has limited the effectiveness of IS-based methods in long-horizon settings. The theoretical insights and bounds offered in the paper pave the way for more efficient RL algorithms, potentially enabling safer and more reliable deployment of RL in real-world applications, such as in medical treatments or autonomous systems where online policy evaluation is costly or risky.

Future research could explore the extension of MIS to broader RL scenarios, such as continuous state and action spaces, or its application in model-free RL approaches. Additionally, the integration of MIS with model-based techniques could further enhance OPE by combining low-variance estimation with predictive capabilities.

MIS stands as a promising advancement in achieving more accurate and less variance-prone OPE, addressing crucial challenges in the deployment of RL algorithms in real-world environments.