Empirical Soft Regret (ESR)
- Empirical Soft Regret (ESR) is a surrogate loss function designed for binary decision-making under bandit feedback, directly targeting reduction in decision regret.
- It smooths a non-differentiable indicator using a logistic function, enabling gradient-based optimization with flexible models like neural networks.
- Empirical results on recommendation and causal inference benchmarks demonstrate ESR's ability to achieve lower regret and higher click-through rates compared to traditional methods.
Empirical Soft Regret (ESR) is a surrogate loss function for the predict-then-optimize paradigm in binary decision-making under bandit feedback. ESR directly targets reduction in decision regret, distinguishing itself from classical mean-squared error (MSE) and related approaches by being specifically designed for settings where only observed rewards for chosen actions (not counterfactual outcomes) are available. The ESR loss is constructed to be differentiable, enabling the training of highly flexible parametric models, including neural networks, through gradient-based optimization. ESR is theoretically justified under paired-data idealizations and empirically validated on benchmarks in recommendation and causal inference, where it achieves lower regret than state-of-the-art baselines (Tan et al., 2024).
1. Predict-then-Optimize Setup and Regret Formulation
In the predict-then-optimize framework, a practitioner observes historical data consisting of triplets , where is a context, is a binary action, and is the realized reward from an unknown function . For each new context , the goal is to select that maximizes reward.
A parametric model is trained to predict outcomes. The induced policy is
and the regret at context is
The objective is to minimize expected regret , rather than pointwise prediction error.
2. ESR Loss Definition and Surrogate Construction
The canonical “hard” empirical regret for a datapoint and its counterfactual is
where and . The indicator function is not differentiable with respect to .
ESR replaces the non-differentiable indicator with a smooth logistic surrogate. For smoothing parameter : In practical bandit feedback, both and may not be observed for each . ESR approximates counterfactuals via nearest-neighbor pairing across actions:
Yielding the empirical soft regret loss:
3. Derivation, Differentiability, and Gradient Structure
As with , ESR recovers the hard regret: the denominator becomes $1$ where signs disagree (incorrect decision) and otherwise (correct decision). The loss is differentiable with respect to model parameters:
where . This direct differentiability makes ESR amenable to modern autodiff and optimizer frameworks.
4. Training Implementation
Training with ESR involves the following steps for each epoch:
- Precompute neighbor indices using KD-tree or approximate nearest neighbors in context space,
- For each minibatch, for each compute:
- ,
- ,
- ,
- Batch loss is ,
- Backpropagate gradient using standard optimizers (e.g., Adam).
Hyperparameters include smoothing (typical values in ) and learning rate. For numerical stability, arguments to should be clipped to . The nearest-neighbor search requires preprocessing.
5. Theoretical Properties: Asymptotic Optimality
Under the idealization where both actions are observed for each context (paired data), and assuming:
- Bounded action gap: ,
- Model class has covering number ,
- The predicted difference does not concentrate near zero too rapidly,
one establishes, for and regularity on the function class, that: at exponential probability, where and minimizes . The proof builds on uniform convergence of the soft surrogate to hard regret and balances statistical complexity with the smoothing bias via choice of .
6. Empirical Results and Benchmarks
ESR has been evaluated on two distinct benchmark domains:
IHDP semi-synthetic CATE Benchmark (, 25 covariates):
- Baselines: S-learner, T-learner, R-learner, DR-learner using two-layer neural nets.
- Test-set regret (95% CIs over 1000 runs):
Method Test Regret (95% CI) S-learner [1.02, 1.36] T-learner [0.70, 0.97] R-learner [2.81, 3.19] DR-learner [0.77, 1.21] ESR () [0.35, 0.43] Yahoo! R6A News Recommendation (M impressions, 20 articles):
- Binary reduction by sampling article pairs.
- Off-policy evaluation via naive IPS; baseline methods as above.
- Estimated click-through rate (95% CIs over 10 days):
Method CTR (95% CI) ESR [4.11%, 4.45%] Direct (MSE) [3.72%, 4.04%] T-learner [3.57%, 3.84%] R-learner [3.56%, 3.84%] DR-learner [3.50%, 3.79%]
In both settings, ESR outperforms established baselines with statistical significance.
7. Implementation Notes, Scope, and Limitations
Recommended settings for ESR include smoothing and learning rates around . Standard optimizers are effective. Nearest-neighbor retrieval in is critical to construction of pseudo-counterfactual pairs and directly affects regret estimation accuracy; high data sparsity can limit performance. The main computational overhead is in one-time neighbor search; subsequent training is standard.
ESR is most beneficial when counterfactual outcomes are absent and classical MSE is insufficient due to context-dependent reward shifts. It enables direct minimization of regret rather than pointwise error, providing a performance advantage in decision-focused scenarios such as bandit feedback, policy learning, and individualized treatment effect estimation.
Limitations include restriction to binary actions; generalization to multiple or continuous actions is an open area. The method presumes model capacity sufficient to learn the true reward gap structure. Nearest-neighbor quality depends on geometry and data density, introducing a tradeoff between approximation fidelity and computational cost.
For comprehensive mathematical development, empirical results, and implementation details, see (Tan et al., 2024).