Empirical Soft Regret (ESR)

Updated 4 March 2026

Empirical Soft Regret (ESR) is a surrogate loss function designed for binary decision-making under bandit feedback, directly targeting reduction in decision regret.
It smooths a non-differentiable indicator using a logistic function, enabling gradient-based optimization with flexible models like neural networks.
Empirical results on recommendation and causal inference benchmarks demonstrate ESR's ability to achieve lower regret and higher click-through rates compared to traditional methods.

Empirical Soft Regret (ESR) is a surrogate loss function for the predict-then-optimize paradigm in binary decision-making under bandit feedback. ESR directly targets reduction in decision regret, distinguishing itself from classical mean-squared error (MSE) and related approaches by being specifically designed for settings where only observed rewards for chosen actions (not counterfactual outcomes) are available. The ESR loss is constructed to be differentiable, enabling the training of highly flexible parametric models, including neural networks, through gradient-based optimization. ESR is theoretically justified under paired-data idealizations and empirically validated on benchmarks in recommendation and causal inference, where it achieves lower regret than state-of-the-art baselines (Tan et al., 2024).

1. Predict-then-Optimize Setup and Regret Formulation

In the predict-then-optimize framework, a practitioner observes historical data consisting of triplets $(w_i, x_i, y_i)$ , where $w_i \in \mathcal{W}$ is a context, $x_i \in \{0,1\}$ is a binary action, and $y_i = f(x_i, w_i)$ is the realized reward from an unknown function $f:\{0,1\} \times \mathcal{W} \to \mathbb{R}$ . For each new context $w$ , the goal is to select $x$ that maximizes reward.

A parametric model $\hat f_\theta(x, w)$ is trained to predict outcomes. The induced policy is

$\pi_\theta(w) = \arg\max_{x \in \{0,1\}} \hat f_\theta(x, w)$

and the regret at context $w$ is

$R_\theta(w) = \max_{x \in \{0,1\}} f(x, w) - f(\pi_\theta(w), w).$

The objective is to minimize expected regret $\mathbb{E}_w[R_\theta(w)]$ , rather than pointwise prediction error.

2. ESR Loss Definition and Surrogate Construction

The canonical “hard” empirical regret for a datapoint $(w_i, x_i, y_i)$ and its counterfactual is

$R_\theta(w_i) = \mathbf{1}\bigl\{\mathrm{sign}(\Delta f_i) \neq \mathrm{sign}(\Delta \hat f_i)\bigr\} |\Delta f_i|,$

where $\Delta f_i = f(1, w_i) - f(0, w_i)$ and $\Delta \hat f_i = \hat f_\theta(1, w_i) - \hat f_\theta(0, w_i)$ . The indicator function is not differentiable with respect to $\theta$ .

ESR replaces the non-differentiable indicator with a smooth logistic surrogate. For smoothing parameter $k > 0$ : $R'_{\theta, k}(w_i) = \frac{|\Delta f_i|}{1 + \exp\left(k\, \mathrm{sign}(\Delta f_i)\, \Delta \hat f_i\right)}.$ In practical bandit feedback, both $x=0$ and $x=1$ may not be observed for each $w_i$ . ESR approximates counterfactuals via nearest-neighbor pairing across actions: $\Delta f_{i, n(i)} = f(x_i, w_i) - f(x_{n(i)}, w_{n(i)})$

$\Delta \hat f_{i, n(i)} = \hat f_\theta(x_i, w_i) - \hat f_\theta(x_{n(i)}, w_{n(i)})$

Yielding the empirical soft regret loss: $L_{ESR, k}(\theta) = \frac{1}{n} \sum_{i=1}^{n} \frac{|\Delta f_{i, n(i)}|} {1 + \exp\left(k\, \mathrm{sign}(\Delta f_{i, n(i)})\, \Delta \hat f_{i, n(i)}\right)}$

3. Derivation, Differentiability, and Gradient Structure

As $k \to \infty$ with $w_{n(i)} \approx w_i$ , ESR recovers the hard regret: the denominator becomes $1$ where signs disagree (incorrect decision) and $\infty$ otherwise (correct decision). The loss is differentiable with respect to model parameters: $\ell_i(\theta) = \frac{D_i}{1 + e^{k S_i}}, \quad D_i = |\Delta f_{i, n(i)}|, \quad S_i = \mathrm{sign}(\Delta f_{i, n(i)})\, \Delta \hat f_{i, n(i)}$

$\nabla_\theta \ell_i = -D_i\, \frac{k e^{k S_i}}{(1 + e^{k S_i})^2} \nabla_\theta S_i = -D_i\, k\, \sigma_k(S_i) \left(1 - \sigma_k(S_i)\right) \mathrm{sign}(\Delta f_{i, n(i)}) \nabla_\theta \Bigl[\hat f_\theta(x_i, w_i) - \hat f_\theta(x_{n(i)}, w_{n(i)})\Bigr]$

where $\sigma_k(s) = 1 / (1 + e^{-k s})$ . This direct differentiability makes ESR amenable to modern autodiff and optimizer frameworks.

4. Training Implementation

Training with ESR involves the following steps for each epoch:

Precompute neighbor indices $n(i)$ using KD-tree or approximate nearest neighbors in context space,
For each minibatch, for each $i$ $i$ compute:
- $j = n(i)$ ,
- $D_i = |y_i - y_j|$ ,
- $S_i = \mathrm{sign}(y_i - y_j)[f_\theta(x_i, w_i) - f_\theta(x_j, w_j)]$ ,
Batch loss is $L = (1/B) \sum_{i \in I} D_i / [1 + \exp(k S_i)]$ ,
Backpropagate gradient $\nabla_\theta L$ using standard optimizers (e.g., Adam).

Hyperparameters include smoothing $k$ (typical values in $[5, 100]$ ) and learning rate. For numerical stability, arguments to $\exp$ should be clipped to $[-50, 50]$ . The nearest-neighbor search requires $O(n \log n)$ preprocessing.

5. Theoretical Properties: Asymptotic Optimality

Under the idealization where both actions are observed for each context (paired data), and assuming:

Bounded action gap: $\sup_w|f(1, w) - f(0, w)| < \infty$ ,
Model class has covering number $N(\alpha) = O(\alpha^{-2})$ ,
The predicted difference $\Delta \hat f_\theta(w)$ does not concentrate near zero too rapidly,

one establishes, for $k(n) \ge n^{1/4} \log n$ and regularity on the function class, that: $\mathbb{E}[R_{\hat\theta_{ESR, n}}(w)] - \mathbb{E}[R_{\theta^*}(w)] \xrightarrow[n \to \infty]{\Pr} 0$ at exponential probability, where $\theta^* = \arg\min_\theta \mathbb{E}[R_\theta(w)]$ and $\hat\theta_{ESR, n}$ minimizes $L_{ESR, k(n)}(\theta)$ . The proof builds on uniform convergence of the soft surrogate to hard regret and balances statistical complexity with the smoothing bias via choice of $k$ .

6. Empirical Results and Benchmarks

ESR has been evaluated on two distinct benchmark domains:

IHDP semi-synthetic CATE Benchmark ( $n \approx 747$ , 25 covariates):
- Baselines: S-learner, T-learner, R-learner, DR-learner using two-layer neural nets.
- Test-set regret (95% CIs over 1000 runs):
Method Test Regret (95% CI)

S-learner [1.02, 1.36]

T-learner [0.70, 0.97]

R-learner [2.81, 3.19]

DR-learner [0.77, 1.21]

ESR ( $k=25$ ) [0.35, 0.43]
Yahoo! R6A News Recommendation ( $\approx 45$ M impressions, 20 articles):
- Binary reduction by sampling article pairs.
- Off-policy evaluation via naive IPS; baseline methods as above.
- Estimated click-through rate (95% CIs over 10 days):
Method CTR (95% CI)

ESR [4.11%, 4.45%]

Direct (MSE) [3.72%, 4.04%]

T-learner [3.57%, 3.84%]

R-learner [3.56%, 3.84%]

DR-learner [3.50%, 3.79%]

Method	Test Regret (95% CI)
S-learner	[1.02, 1.36]
T-learner	[0.70, 0.97]
R-learner	[2.81, 3.19]
DR-learner	[0.77, 1.21]
ESR ( $k=25$ )	[0.35, 0.43]

Method	CTR (95% CI)
ESR	[4.11%, 4.45%]
Direct (MSE)	[3.72%, 4.04%]
T-learner	[3.57%, 3.84%]
R-learner	[3.56%, 3.84%]
DR-learner	[3.50%, 3.79%]

In both settings, ESR outperforms established baselines with statistical significance.

7. Implementation Notes, Scope, and Limitations

Recommended settings for ESR include smoothing $k \approx 25$ and learning rates around $10^{-3}$ . Standard optimizers are effective. Nearest-neighbor retrieval in $\mathcal{W}$ is critical to construction of pseudo-counterfactual pairs and directly affects regret estimation accuracy; high data sparsity can limit performance. The main computational overhead is in one-time neighbor search; subsequent training is standard.

ESR is most beneficial when counterfactual outcomes are absent and classical MSE is insufficient due to context-dependent reward shifts. It enables direct minimization of regret rather than pointwise error, providing a performance advantage in decision-focused scenarios such as bandit feedback, policy learning, and individualized treatment effect estimation.

Limitations include restriction to binary actions; generalization to multiple or continuous actions is an open area. The method presumes model capacity sufficient to learn the true reward gap structure. Nearest-neighbor quality depends on $\mathcal{W}$ geometry and data density, introducing a tradeoff between approximation fidelity and computational cost.

For comprehensive mathematical development, empirical results, and implementation details, see (Tan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Asymptotically Optimal Regret for Black-Box Predict-then-Optimize (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Empirical Soft Regret (ESR).

Empirical Soft Regret (ESR)

1. Predict-then-Optimize Setup and Regret Formulation

2. ESR Loss Definition and Surrogate Construction

3. Derivation, Differentiability, and Gradient Structure

4. Training Implementation

5. Theoretical Properties: Asymptotic Optimality

6. Empirical Results and Benchmarks

7. Implementation Notes, Scope, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Empirical Soft Regret (ESR)

1. Predict-then-Optimize Setup and Regret Formulation

2. ESR Loss Definition and Surrogate Construction

3. Derivation, Differentiability, and Gradient Structure

4. Training Implementation

5. Theoretical Properties: Asymptotic Optimality

6. Empirical Results and Benchmarks

7. Implementation Notes, Scope, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research