Counterfactual Risk Minimization (CRM)

Updated 18 March 2026

Counterfactual Risk Minimization (CRM) is a statistical learning approach that minimizes counterfactual risk by regularizing importance-weighted estimators to control high variance.
It applies algorithms like POEM, Bayesian CRM, and DRO-CRM to optimize policies for contextual bandits, learning-to-rank, structured prediction, and recommender systems.
CRM enhances empirical risk minimization by incorporating data-dependent generalization bounds and variance penalties, thereby improving safety and performance in off-policy scenarios.

Counterfactual Risk Minimization (CRM) is a statistical learning principle for learning policies in the presence of logged bandit feedback, where only outcomes for actions actually taken by a potentially biased historical policy are observed. CRM produces data-dependent generalization bounds by regularizing the high variance of importance-weighted estimators, enabling reliable off-policy learning even from highly selective or non-i.i.d. interaction logs. The principle and its algorithmic instantiations—such as POEM, Bayesian CRM, divergence- and exposure-regularized extensions—underpin state-of-the-art algorithms for contextual bandit, learning-to-rank, structured prediction, and recommender system problems.

1. Fundamentals of the CRM Framework

CRM formalizes learning from logged bandit feedback as follows. Given inputs $x \in \mathcal X$ and a (possibly huge) output space $\mathcal Y$ , the aim is to learn a stochastic policy $h \in \mathcal H$ that minimizes the expected risk

$R(h) = \mathbb E_{x} \left[\,\mathbb E_{y \sim h(\cdot|x)}[\delta(x,y)]\,\right],$

where $\delta(x,y)$ is a (negative) reward or loss. Instead of full supervision, the learner observes a dataset $D = \{(x_i, y_i, \delta_i, p_i)\}$ , where $y_i \sim h_0(\cdot|x_i)$ comes from a logging policy $h_0$ (with known propensity $p_i = h_0(y_i|x_i)$ ) and only $\delta(x_i, y_i)$ is observable. This precludes ERM over the full action space and requires off-policy estimators.

The unbiased estimator of $R(h)$ is the inverse propensity scoring (IPS) estimator:

$\widehat R(h) = \frac{1}{n} \sum_{i=1}^n \delta_i \frac{h(y_i|x_i)}{p_i},$

but when some $p_i$ are small, the weights $h(y_i|x_i)/p_i$ can explode, yielding high or even unbounded variance. CRM addresses this by introducing data-dependent regularization that penalizes variance, producing robust learning objectives (Swaminathan et al., 2015).

2. Theoretical Basis and Generalization Bounds

CRM is motivated by empirical Bernstein bounds, which provide uniform risk control based on both the mean and the empirical variance of importance-weighted losses. For clipped weights (with threshold $M$ ), define

$u^i_h = \delta_i \min\left\{M, \frac{h(y_i|x_i)}{p_i}\right\},$

then the empirical mean $\overline{u_h}$ and variance $\mathrm{Var}_h(u)$ control the generalization error. Specifically, a high-probability bound holds:

$R(h) \leq \widehat R_M(h) + \sqrt{ \frac{18\,\mathrm{Var}_h(u)\,\mathcal Q(n,\gamma) } { n } + M\, \frac{15 \mathcal Q(n,\gamma)}{n-1} },$

where $\mathcal Q(n,\gamma)$ is a complexity term depending on the covering number of the (clipped, importance-weighted) loss class (Swaminathan et al., 2015). Dropping constants, this leads to the core CRM objective:

$h^* = \arg\min_{h \in \mathcal H} \left\{ \widehat R_M(h) + \lambda \sqrt{ \frac{ \mathrm{Var}_h(u)}{ n } } \right\},$

where $\lambda$ is a tunable regularization parameter. This penalizes high-variance policies, reducing overfitting to spurious estimation artifacts induced by rare actions or insufficient exploration.

CRM generalizes empirical risk minimization to the off-policy regime and yields risk upper bounds that scale as $\mathcal O(\sqrt{ \mathrm{Var}(h)/ n })$ , which are tighter in well-explored regions and conservative otherwise.

3. Algorithmic Realizations: POEM, Bayesian CRM, DRO

The Policy Optimizer for Exponential Models (POEM) instantiates CRM for policies parameterized as exponential families:

$h_w(y|x) = \frac{ \exp( w \cdot \phi(x, y) ) }{ Z_w(x) }.$

The clipped CRM objective becomes

$\min_{w} \left\{ \overline{u_w} + \lambda \sqrt{ \mathrm{Var}_w(u)/ n } \right\},$

where

$u^i_w = \delta_i \min\left\{ M,\; \frac{ \exp( w \cdot \phi(x_i, y_i) ) }{ p_i Z_w(x_i) } \right\}.$

To address nonconvexity and cross-example coupling in the variance term, POEM uses a majorization-minimization (MM) surrogate that decouples the variance, enabling efficient stochastic optimization (Swaminathan et al., 2015).

Subsequent work places CRM in a broader statistical learning context. PAC-Bayesian CRM (London et al., 2018) introduces a generalization bound for truncated IPS risk using Gibbs posteriors, motivating "logging-policy regularization"—direct $L_2$ regularization toward the logging policy's parameters—which matches or exceeds variance regularization in practice, while being computationally simpler.

Distributionally Robust CRM (DRO-CRM) (Faury et al., 2019) interprets CRM variance penalization as a special case of DRO with $\chi^2$ uncertainty sets. More generally, using a Kullback-Leibler (KL) ball yields a new robust CRM objective:

$R_{\text{DRO}}^{\text{KL}}(\theta;\epsilon) = \inf_{\gamma > 0} \left\{ \gamma \epsilon + \gamma \log \mathbb E_{P_n}[\exp( \ell(\xi; \theta)/\gamma ) ] \right\},$

which reweighs samples exponentially in their loss, focusing learning on high-variance or rare events and providing advantages in small-sample or ill-explored regimes.

Recent VRCRM approaches (Bakker et al., 2024) use $f$ -divergence penalties, but empirical studies suggest that direct divergence approximation is simpler and more robust than adversarial discriminator-based lower bounds.

4. Extensions: Learning-to-Rank, Structured Prediction, Recommender Systems

CRM supports a wide range of structures and application domains. For learning-to-rank, CRM with exposure-based risk regularization (Gupta et al., 2023) penalizes the Rényi divergence between the per-policy document exposure distributions, which is computationally feasible in large permutation spaces where action-level divergences are impractical. The resulting risk minimization balances the expected IPS loss and the divergence regularizer, providing theoretical bounds on variance and high-confidence safety for deployments.

In recommender systems, CRM is instantiated via IPS-weighted pairwise ranking losses (such as Bayesian Personalized Ranking (BPR)), often with a propensity regularizer (PR) penalizing high-magnitude weights to control variance. Self-normalized IPS (SNIPS) is used for evaluation to reduce estimator variance, with practical guidance on tuning, effective sample-size monitoring, and diagnostics for real-world adoption (Raja et al., 30 Aug 2025).

Structured output prediction benefits from CRM in settings where the action or label space is combinatorial (e.g., multi-label classification), with POEM and KL-CRM yielding strong empirical results for test loss and argmax accuracy over standard baselines (Swaminathan et al., 2015, Faury et al., 2019, London et al., 2018).

5. Advanced Variants: Sequential CRM, Continuous Actions, Representation Learning

Sequential CRM (SCRM) (Zenati et al., 2023) generalizes CRM to settings where multiple policy deployments and logged data collection rounds are possible. SCRM constructs CRM upper bounds on the risk at each iteration, using variance-controlled estimators such as IPS-IX, and achieves accelerated excess risk and regret rates via restart strategies analogous to those in accelerated convex optimization.

For continuous action spaces (as in personalized pricing and bidding), CRM employs density-based importance weighting, with joint kernel embeddings of contexts and actions to model the target and logging policy distributions. Variance-reducing techniques include kernel ridge-regression smoothing of density ratios and smooth (differentiable) weight clipping. Proximal-point optimization further enhances nonconvex objective convergence. Self-normalized and effective sample-size-based diagnostics underpin reliable offline selection in these settings (Zenati et al., 2020).

CRM has also been adapted for representation learning and causal robustness, as in Counterfactual Adversarial Training (CAT), where CRM is used to dynamically re-weight sample losses between original and counterfactual (interpolated) latent representations, driving the model toward causal feature discovery in the presence of spurious correlations (Wang et al., 2021).

6. Empirical Evidence and Practical Impact

Empirical evaluations across multi-label classification, learning-to-rank, and recommender benchmarks establish CRM's superiority over naïve IPS, ERM, and direct $L_2$ regularization. POEM and CRM-derived algorithms yield statistically significant gains in Hamming loss, NDCG@k, argmax accuracy, and robustness—especially in the small-data regime or when logging policies provide selective, biased exposure. CRM-based methods exhibit efficient convergence, stable offline policy evaluation, and strong generalization. For instance,

Method	Dataset	Logging Policy	IPS	POEM / CRM	Oracle
POEM (Scene)	[Multi-label]	1.529	1.157	1.128	0.646
Exposure-CRM (Y! Webscope)	[LTR]	0.677	0.659	0.677	0.727

(Swaminathan et al., 2015, Gupta et al., 2023, Raja et al., 30 Aug 2025)

Furthermore, CRM-based safe deployment in learning-to-rank and sequential policies effectively eliminates catastrophic “cold start” performance dips, accelerating attainment of optimal reward as compared to IPS-only or non-regularized approaches (Gupta et al., 2023, Zenati et al., 2023).

7. Limitations, Open Problems, and Outlook

Although CRM systematically controls variance in off-policy learning, several practical and theoretical questions remain. In large or continuous action spaces, accurate propensity estimation, sufficient exploration, and diagnostic tools such as effective sample size remain critical. Adversarial and $f$ -GAN based divergence regularization can be unstable or difficult to tune, making direct divergence control preferable (Bakker et al., 2024). Extensions to reinforcement learning, robust policy selection, and causal representation learning continue to be active research areas.

A plausible implication is that CRM provides a foundation for robust off-policy learning across complex domains, with ongoing improvements focused on tighter bounds, computational efficiency, scalable variance surrogates, and adaptation to nonstationary or real-world, large-scale feedback scenarios.