Papers
Topics
Authors
Recent
Search
2000 character limit reached

Empirical Soft Regret (ESR)

Updated 4 March 2026
  • Empirical Soft Regret (ESR) is a surrogate loss function designed for binary decision-making under bandit feedback, directly targeting reduction in decision regret.
  • It smooths a non-differentiable indicator using a logistic function, enabling gradient-based optimization with flexible models like neural networks.
  • Empirical results on recommendation and causal inference benchmarks demonstrate ESR's ability to achieve lower regret and higher click-through rates compared to traditional methods.

Empirical Soft Regret (ESR) is a surrogate loss function for the predict-then-optimize paradigm in binary decision-making under bandit feedback. ESR directly targets reduction in decision regret, distinguishing itself from classical mean-squared error (MSE) and related approaches by being specifically designed for settings where only observed rewards for chosen actions (not counterfactual outcomes) are available. The ESR loss is constructed to be differentiable, enabling the training of highly flexible parametric models, including neural networks, through gradient-based optimization. ESR is theoretically justified under paired-data idealizations and empirically validated on benchmarks in recommendation and causal inference, where it achieves lower regret than state-of-the-art baselines (Tan et al., 2024).

1. Predict-then-Optimize Setup and Regret Formulation

In the predict-then-optimize framework, a practitioner observes historical data consisting of triplets (wi,xi,yi)(w_i, x_i, y_i), where wiWw_i \in \mathcal{W} is a context, xi{0,1}x_i \in \{0,1\} is a binary action, and yi=f(xi,wi)y_i = f(x_i, w_i) is the realized reward from an unknown function f:{0,1}×WRf:\{0,1\} \times \mathcal{W} \to \mathbb{R}. For each new context ww, the goal is to select xx that maximizes reward.

A parametric model f^θ(x,w)\hat f_\theta(x, w) is trained to predict outcomes. The induced policy is

πθ(w)=argmaxx{0,1}f^θ(x,w)\pi_\theta(w) = \arg\max_{x \in \{0,1\}} \hat f_\theta(x, w)

and the regret at context ww is

Rθ(w)=maxx{0,1}f(x,w)f(πθ(w),w).R_\theta(w) = \max_{x \in \{0,1\}} f(x, w) - f(\pi_\theta(w), w).

The objective is to minimize expected regret Ew[Rθ(w)]\mathbb{E}_w[R_\theta(w)], rather than pointwise prediction error.

2. ESR Loss Definition and Surrogate Construction

The canonical “hard” empirical regret for a datapoint (wi,xi,yi)(w_i, x_i, y_i) and its counterfactual is

Rθ(wi)=1{sign(Δfi)sign(Δf^i)}Δfi,R_\theta(w_i) = \mathbf{1}\bigl\{\mathrm{sign}(\Delta f_i) \neq \mathrm{sign}(\Delta \hat f_i)\bigr\} |\Delta f_i|,

where Δfi=f(1,wi)f(0,wi)\Delta f_i = f(1, w_i) - f(0, w_i) and Δf^i=f^θ(1,wi)f^θ(0,wi)\Delta \hat f_i = \hat f_\theta(1, w_i) - \hat f_\theta(0, w_i). The indicator function is not differentiable with respect to θ\theta.

ESR replaces the non-differentiable indicator with a smooth logistic surrogate. For smoothing parameter k>0k > 0: Rθ,k(wi)=Δfi1+exp(ksign(Δfi)Δf^i).R'_{\theta, k}(w_i) = \frac{|\Delta f_i|}{1 + \exp\left(k\, \mathrm{sign}(\Delta f_i)\, \Delta \hat f_i\right)}. In practical bandit feedback, both x=0x=0 and x=1x=1 may not be observed for each wiw_i. ESR approximates counterfactuals via nearest-neighbor pairing across actions: Δfi,n(i)=f(xi,wi)f(xn(i),wn(i))\Delta f_{i, n(i)} = f(x_i, w_i) - f(x_{n(i)}, w_{n(i)})

Δf^i,n(i)=f^θ(xi,wi)f^θ(xn(i),wn(i))\Delta \hat f_{i, n(i)} = \hat f_\theta(x_i, w_i) - \hat f_\theta(x_{n(i)}, w_{n(i)})

Yielding the empirical soft regret loss: LESR,k(θ)=1ni=1nΔfi,n(i)1+exp(ksign(Δfi,n(i))Δf^i,n(i))L_{ESR, k}(\theta) = \frac{1}{n} \sum_{i=1}^{n} \frac{|\Delta f_{i, n(i)}|} {1 + \exp\left(k\, \mathrm{sign}(\Delta f_{i, n(i)})\, \Delta \hat f_{i, n(i)}\right)}

3. Derivation, Differentiability, and Gradient Structure

As kk \to \infty with wn(i)wiw_{n(i)} \approx w_i, ESR recovers the hard regret: the denominator becomes $1$ where signs disagree (incorrect decision) and \infty otherwise (correct decision). The loss is differentiable with respect to model parameters: i(θ)=Di1+ekSi,Di=Δfi,n(i),Si=sign(Δfi,n(i))Δf^i,n(i)\ell_i(\theta) = \frac{D_i}{1 + e^{k S_i}}, \quad D_i = |\Delta f_{i, n(i)}|, \quad S_i = \mathrm{sign}(\Delta f_{i, n(i)})\, \Delta \hat f_{i, n(i)}

θi=DikekSi(1+ekSi)2θSi=Dikσk(Si)(1σk(Si))sign(Δfi,n(i))θ[f^θ(xi,wi)f^θ(xn(i),wn(i))]\nabla_\theta \ell_i = -D_i\, \frac{k e^{k S_i}}{(1 + e^{k S_i})^2} \nabla_\theta S_i = -D_i\, k\, \sigma_k(S_i) \left(1 - \sigma_k(S_i)\right) \mathrm{sign}(\Delta f_{i, n(i)}) \nabla_\theta \Bigl[\hat f_\theta(x_i, w_i) - \hat f_\theta(x_{n(i)}, w_{n(i)})\Bigr]

where σk(s)=1/(1+eks)\sigma_k(s) = 1 / (1 + e^{-k s}). This direct differentiability makes ESR amenable to modern autodiff and optimizer frameworks.

4. Training Implementation

Training with ESR involves the following steps for each epoch:

  • Precompute neighbor indices n(i)n(i) using KD-tree or approximate nearest neighbors in context space,
  • For each minibatch, for each ii compute:
    • j=n(i)j = n(i),
    • Di=yiyjD_i = |y_i - y_j|,
    • Si=sign(yiyj)[fθ(xi,wi)fθ(xj,wj)]S_i = \mathrm{sign}(y_i - y_j)[f_\theta(x_i, w_i) - f_\theta(x_j, w_j)],
  • Batch loss is L=(1/B)iIDi/[1+exp(kSi)]L = (1/B) \sum_{i \in I} D_i / [1 + \exp(k S_i)],
  • Backpropagate gradient θL\nabla_\theta L using standard optimizers (e.g., Adam).

Hyperparameters include smoothing kk (typical values in [5,100][5, 100]) and learning rate. For numerical stability, arguments to exp\exp should be clipped to [50,50][-50, 50]. The nearest-neighbor search requires O(nlogn)O(n \log n) preprocessing.

5. Theoretical Properties: Asymptotic Optimality

Under the idealization where both actions are observed for each context (paired data), and assuming:

  • Bounded action gap: supwf(1,w)f(0,w)<\sup_w|f(1, w) - f(0, w)| < \infty,
  • Model class has covering number N(α)=O(α2)N(\alpha) = O(\alpha^{-2}),
  • The predicted difference Δf^θ(w)\Delta \hat f_\theta(w) does not concentrate near zero too rapidly,

one establishes, for k(n)n1/4lognk(n) \ge n^{1/4} \log n and regularity on the function class, that: E[Rθ^ESR,n(w)]E[Rθ(w)]nPr0\mathbb{E}[R_{\hat\theta_{ESR, n}}(w)] - \mathbb{E}[R_{\theta^*}(w)] \xrightarrow[n \to \infty]{\Pr} 0 at exponential probability, where θ=argminθE[Rθ(w)]\theta^* = \arg\min_\theta \mathbb{E}[R_\theta(w)] and θ^ESR,n\hat\theta_{ESR, n} minimizes LESR,k(n)(θ)L_{ESR, k(n)}(\theta). The proof builds on uniform convergence of the soft surrogate to hard regret and balances statistical complexity with the smoothing bias via choice of kk.

6. Empirical Results and Benchmarks

ESR has been evaluated on two distinct benchmark domains:

  • IHDP semi-synthetic CATE Benchmark (n747n \approx 747, 25 covariates):

    • Baselines: S-learner, T-learner, R-learner, DR-learner using two-layer neural nets.
    • Test-set regret (95% CIs over 1000 runs):
    Method Test Regret (95% CI)
    S-learner [1.02, 1.36]
    T-learner [0.70, 0.97]
    R-learner [2.81, 3.19]
    DR-learner [0.77, 1.21]
    ESR (k=25k=25) [0.35, 0.43]
  • Yahoo! R6A News Recommendation (45\approx 45M impressions, 20 articles):

    • Binary reduction by sampling article pairs.
    • Off-policy evaluation via naive IPS; baseline methods as above.
    • Estimated click-through rate (95% CIs over 10 days):
    Method CTR (95% CI)
    ESR [4.11%, 4.45%]
    Direct (MSE) [3.72%, 4.04%]
    T-learner [3.57%, 3.84%]
    R-learner [3.56%, 3.84%]
    DR-learner [3.50%, 3.79%]

In both settings, ESR outperforms established baselines with statistical significance.

7. Implementation Notes, Scope, and Limitations

Recommended settings for ESR include smoothing k25k \approx 25 and learning rates around 10310^{-3}. Standard optimizers are effective. Nearest-neighbor retrieval in W\mathcal{W} is critical to construction of pseudo-counterfactual pairs and directly affects regret estimation accuracy; high data sparsity can limit performance. The main computational overhead is in one-time neighbor search; subsequent training is standard.

ESR is most beneficial when counterfactual outcomes are absent and classical MSE is insufficient due to context-dependent reward shifts. It enables direct minimization of regret rather than pointwise error, providing a performance advantage in decision-focused scenarios such as bandit feedback, policy learning, and individualized treatment effect estimation.

Limitations include restriction to binary actions; generalization to multiple or continuous actions is an open area. The method presumes model capacity sufficient to learn the true reward gap structure. Nearest-neighbor quality depends on W\mathcal{W} geometry and data density, introducing a tradeoff between approximation fidelity and computational cost.

For comprehensive mathematical development, empirical results, and implementation details, see (Tan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Empirical Soft Regret (ESR).