Cross-Policy Sampling

Updated 12 October 2025

Cross-policy sampling is a methodology that strategically allocates simulation and observational data across multiple policies for improved evaluation and decision-making.
It employs adaptive budget allocation, importance sampling, and bandit-based sampling to reduce estimation variance and enhance computational efficiency.
This approach is pivotal in off-policy evaluation, meta-policy selection, and simulation-based design across diverse applications like robotics, multi-agent systems, and healthcare.

Cross-policy sampling refers to a family of methodologies and algorithmic frameworks in which samples (i.e., simulated rollouts, data trajectories, or observations) are adaptively selected, allocated, or re-used across multiple policies—either for evaluation, optimization, or ensemble selection—rather than subsampling uniformly or independently for each policy under consideration. The paradigm is central to modern reinforcement learning (RL), stochastic optimization, simulation-based model selection, and sequential decision problems, and manifests in the form of importance sampling, adaptive sampling via bandit formulations, ensemble policy selection, and policy distillation, among other techniques. The defining characteristic is the strategic sharing, prioritization, or joint allocation of simulation or real-world interaction budget across states, actions, or policy candidates, with the intent of improving efficiency, variance reduction, or adaptability in highly stochastic or resource-constrained environments.

1. Core Principles of Cross-Policy Sampling

Cross-policy sampling encompasses several methodological principles drawn from RL, bandit theory, and simulation:

Adaptive Budget Allocation: Rather than preassigning a fixed number of simulations to each state-action-policy triplet, samples are adaptively allocated to states, actions, or policies most likely to improve the decision. In "Rollout Sampling Approximate Policy Iteration" (0805.2027), this is achieved by recasting the sampling allocation among states as a multi-armed bandit problem, where each state is an arm and rollouts are allocated based on a utility criterion designed to maximize information gain or sample efficiency.
Interleaved/Data-Efficient Sampling: Policies are evaluated via simulation not only independently but also by sharing, prioritizing, or terminating sampling early for resolved states, as in the RSPI variants leveraging upper confidence bounds and early-stopping via Hoeffding's inequality.
Importance Sampling across Policies: In off-policy evaluation (OPE) and optimization, importance sampling (IS) is employed to reweight samples drawn from a behavior policy so that they are informative for a target policy. Extensions to this idea involve actively optimizing the behavior policy to reduce the variance of estimator, thereby achieving improved sample efficiency (Papini et al., 9 May 2024).
Ensemble and Meta-Policy Construction: When multiple candidate policies exist (e.g., in contextual stochastic optimization), the selection of which policy to deploy in a given context is learned via meta-policies that exploit the heterogeneity of candidate performance across covariate regimes, as in the Prescribe-then-Select framework (Iglesias et al., 9 Sep 2025).

2. Multi-Armed Bandit Formulations and Adaptive Sampling

A foundational contribution in cross-policy sampling is the recasting of rollout allocation as a bandit resource allocation problem (0805.2027). The sampling problem is formalized as follows:

Each state $s \in S_{R}$ is considered a bandit arm.
Pulling an arm corresponds to running a simulation (rollout) from state $s$ for all actions.
Utilities such as $U_{\textrm{ucba}}(s) = \hat{\Delta}^\pi(s) + \sqrt{1/(1 + c(s))}$ or $U_{\textrm{ucbb}}(s) = \hat{\Delta}^\pi(s) + \sqrt{\ln m/(1 + c(s))}$ balance empirical value difference $\hat{\Delta}^\pi(s)$ (exploitative sampling) with sampling uncertainty (explorative sampling), where $c(s)$ counts samples from $s$ , and $m$ is the cumulative number of rollouts used.
A state is considered "resolved" when a stopping rule based on Hoeffding's inequality is satisfied:

$\hat{\Delta}^\pi(s) \geq \sqrt{\frac{(b_2 - b_1)^2}{2c(s)}\ln \frac{|\mathcal{A}| - 1}{\delta}}$

thus with at least $1 - \delta$ confidence, the sampled action difference is statistically significant.

Bandit-based sampling ensures sample allocation is focused on states where action superiority is ambiguous, accelerating convergence and economizing computational resources.

3. Cross-Policy Sampling in Off-Policy Evaluation and Optimization

Importance sampling underpins cross-policy sample reuse across a spectrum of RL settings. Given samples $x_i \sim q$ , one estimates expectations under the target $p$ :

$\hat{\mu}_{p/q} = \frac{1}{N} \sum_{i=1}^N \frac{p(x_i)}{q(x_i)} f(x_i)$

In policy optimization, trajectory-level IS ratios are used to evaluate candidate policies from rollouts under different policies. Recent advances (Hanna et al., 2018, Zhou et al., 28 May 2025) have established that estimating the behavior policy from data—especially with non-Markovian (history-dependent) conditioning—can systematically reduce the asymptotic variance of IS estimators, though potentially at the cost of increased finite-sample bias. This is formalized in a bias-variance decomposition:

$\operatorname{MSE}(\hat{v}_{\text{OIS}}^{(k)}) = \frac{1}{n} \operatorname{Var}\left( \operatorname{Proj}_{\mathbb{T}(k)}[\lambda_T G_T] \right) + O\left(\frac{(k+1) C^{2T} R_{\max}^2}{n^{3/2} \varepsilon^2}\right)$

where a longer historical dependency in the estimated policy ( $k$ ) decreases asymptotic variance through projection effects, but increases the bias term.

Actively choosing the data-generating policy to minimize estimator variance (behavioral policy optimization) further amplifies gains in sample efficiency, as in "Policy Gradient with Active Importance Sampling" (Papini et al., 9 May 2024), where the behavior density is iteratively updated via cross-entropy minimization to

$p^*(\tau) = \frac{p_\theta(\tau) \|g(\tau)\|_2}{\int p_\theta(\tau') \|g(\tau')\|_2 d\tau'}$

with $g(\tau)$ denoting the per-trajectory policy gradient.

4. Cross-Policy Sampling in Adaptive and Ensemble Policy Selection

In domains with heterogeneous environments or nonuniform distributions of covariates, cross-policy sampling extends to adaptive policy selection frameworks. A prime example is the Prescribe-then-Select (PS) framework (Iglesias et al., 9 Sep 2025), which proceeds as:

Construct a library $\mathcal{P} = \{\pi_1, ..., \pi_M\}$ of feasible candidate policies from different paradigms (e.g., sample average approximation, point-prediction, scenario-based).
Learn a meta-policy via Optimal Policy Trees that maps context $X$ to a choice of candidate policy index, partitioning the covariate space into regions $R_j$ with associated assignments $\gamma_j$ .
The meta-policy achieves at least any improvement manifest by base candidates over a region $R$ with statistical guarantee proportional to region mass.

This design ensures that whenever heterogeneity exists, PS consistently outperforms fixed policies, and reduces to the best singleton policy in the absence of population heterogeneity.

Cross-policy sampling also features prominently in simulation-based optimization for ranking and selection (Zhang et al., 2021), where resource allocation is dynamically aligned with estimated decision value impact. The AOAm policy adaptively maximizes the exponential decay rate of the false selection probability, in contrast with policies based on static allocation.

5. Empirical Validation and Efficiency Benefits

Empirical studies consistently demonstrate that cross-policy sampling methods achieve strong computational savings and improved policy performance across challenging domains:

Inverted pendulum and mountain-car environments: RSPI achieves policy performance commensurate with classic RCPI but requires up to an order of magnitude fewer rollouts, with the $ucba$ rule often marked as most efficient (0805.2027).
Decentralized POMDPs: DecRSPI produces near upper-bound quality joint policies in scenarios with up to 20 agents, exhibiting memory usage and time complexity that scale linearly with agent count (Wu et al., 2012).
Benchmark continuous control: Surrogate objectives based on high-confidence IS bounds (Metelli et al., 2018), as well as adaptive sampling in estimation (e.g., ZO-RL for black-box adversarial attack generation (Zhai et al., 2021)), yield faster convergence and lower estimator variance than naive on-policy or uniform sampling approaches.
Stochastic optimization: Adaptive meta-policy selection (Iglesias et al., 9 Sep 2025) reduces structural risk by leveraging data-driven context-region assignment, outperforming any fixed candidate policy in experimentally validated logistical planning tasks.

6. Theoretical Trade-offs and Limitations

Cross-policy sampling methods must balance several fundamental trade-offs:

Variance Reduction vs. Bias: Incorporating auxiliary information (e.g., extended history in IS behavior policy estimation) systematically decreases estimator variance but may introduce finite-sample bias, especially when the history-dependent parameterization is high-dimensional and the sample size is moderate (Zhou et al., 28 May 2025).
Computational Tractability: Adaptive allocation and bandit sampling, while highly efficient, necessitate bookkeeping infrastructure and sequential decision rules, with possible sensitivity to hyperparameters (e.g., confidence levels $\delta$ , budget constraints, or early stopping thresholds).
Model Class Limitation: Active behavior policy optimization, for instance via cross-entropy minimization (Papini et al., 9 May 2024), is limited by the expressiveness of the behavior policy class; the ideal optimal behavior density may be unattainable with the chosen parameterization.
Finite-Sample Effects: In OPE and meta-policy selection, overfitting and the inadequacy of asymptotic guarantees can erode performance; ensemble or cross-validated methods partially address this.

7. Broader Applications and Research Directions

Cross-policy sampling is widely employed in contexts where interaction, simulation, or data collection is expensive or subject to constraints:

Multi-agent systems: Scalable value-driven sampling (DecRSPI) allows distributed decision making when exhaustive enumeration of policies is infeasible.
Lifelong and meta-learning: By selecting or distilling across policy specialists, generalist performance in heterogenous operational settings is achieved.
Healthcare and policy evaluation: End-to-end OPE frameworks using IS corrections and robust bounds underpin reliable cross-policy inference between trial and real-world deployment populations.
Simulation-based design and resource allocation: Large deviations analysis and dynamic programming–driven adaptive allocation strategies significantly improve the probability of correct selection in high-dimensional alternatives.

Active research questions include: optimal partitioning of simulation budgets for multi-objective or large-scale settings, adaptive trade-off design (bias-variance tuning), and robust under-modelled or nonstationary data distributions.

In sum, cross-policy sampling synthesizes techniques from adaptive resource allocation, importance weighting, meta-policy learning, and variance reduction. It underlies much of modern simulation-based RL and decision-making under uncertainty, offering powerful means to address efficiency, robustness, and scalability in complex, data-parsimonious environments.