Robust Deep ES Estimator

Updated 18 November 2025

The paper introduces a two-stage deep learning framework that orthogonalizes quantile and expected shortfall estimation using deep quantile regression and Huber loss.
It achieves non-asymptotic tail robustness with provable error bounds, effectively mitigating the influence of heavy-tailed residuals.
Empirical studies demonstrate improved prediction accuracy in high-dimensional settings, especially under heavy-tailed noise in environmental applications.

A Robust Deep ES (Expected Shortfall) Estimator in the context of modern machine learning refers to a deep neural methodology for estimating the conditional tail risk of a target variable, designed with explicit robustness to heavy-tailed response distributions and model misspecification. This estimator operates in high-dimensional, nonparametric settings via hierarchical architectures, orthogonalizing the estimation of quantile and expected shortfall functions, and incorporates robustification techniques such as the Huber loss to achieve non-asymptotic resistance to outliers and model noise (Yu et al., 11 Nov 2025).

1. Mathematical Formulation of Expected Shortfall Regression

Let $Y$ be a real-valued response variable with cumulative distribution function $F_Y$ . The Value-at-Risk (VaR) at level $\alpha\in(0,1)$ is $q_\alpha(Y):= \inf\{y: F_Y(y)\geq\alpha\}$ , and the Expected Shortfall (ES, also known as Conditional Value-at-Risk) at level $\alpha$ is

$e_\alpha(Y) := \mathbb{E}\left[ Y \mid Y \leq q_\alpha(Y) \right] = \frac{1}{\alpha} \mathbb{E}\left[ Y \mathbb{1}\{Y \leq q_\alpha(Y)\} \right].$

For regression with covariates $X\in\mathbb{R}^d$ , nonparametric functions $f_0(x):=q_\alpha(Y\mid X\!=\!x)$ and $g_0(x):=e_\alpha(Y\mid X\!=\!x)$ represent the conditional quantile and ES, respectively. Since ES cannot be directly elicited, a robust deep ES estimator employs a "two-step orthogonalization framework": first estimate $f_0$ (conditional quantile) using deep quantile regression (DQR), then estimate $g_0$ based on the residuals, treating $f_0$ as a nuisance parameter (Yu et al., 11 Nov 2025).

2. Algorithmic Structure: Two-Step Deep Robust ES Estimation

The robust deep ES estimator is built as follows:

Stage 1—Deep Quantile Regression (DQR):

Fit a class $\mathcal{F}_n$ of truncated, fully-connected ReLU networks to minimize the empirical check loss

$\widehat{Q}_\alpha(f) = \frac{1}{n}\sum_{i=1}^n \rho_\alpha\left(Y_i - f(X_i)\right),$

where $\rho_\alpha(u)=(\alpha-\mathbb{1}\{u<0\})u$ .

Stage 2—Deep Robust ES (DRES):

For each candidate $f$ , compute surrogate responses $Z_i(f) = \min\{Y_i-f(X_i),0\} + \alpha f(X_i)$ .
Fit a class $\mathcal{G}_n$ of truncated, fully-connected ReLU networks for $g$ by minimizing the average Huber loss:

$\hat{g}_{n,\tau} \in \arg\min_{g\in \mathcal{G}_n} \frac{1}{n}\sum_{i=1}^n \ell_\tau\left(Z_i(\hat{f}_n) - \alpha g(X_i)\right),$

where $\ell_\tau(u)$ is the Huber loss with parameter $\tau$ .

The role of the Huber loss is to introduce robustness against heavy-tailed residuals $Z_i(f)$ , crucial since classical squared-error metrics do not handle outliers gracefully in the tails, which are the focus of expected shortfall (Yu et al., 11 Nov 2025).

3. Statistical Theory and Robustness Guarantees

The robust deep ES estimator achieves provable non-asymptotic tail robustness. Let $\epsilon = Y - f_0(X)$ ; the key technical condition is finite $p$ -th moment of $\epsilon_- = \min(\epsilon,0)$ , i.e., $\mathbb{E}[|\epsilon_- - \mathbb{E}(\epsilon_-|X)|^p]<\infty$ for some $p>1$ . The DRES estimator then satisfies, with high probability,

$\|\hat{g}_{n,\tau} - g_0\|_2 \leq C\left[\eta_s + \eta_b + \eta_a + \delta_s + \delta_4^2 + \frac{\nu_p^{1/p}+\sqrt{\tau}}{\sqrt{n}}\right]/\alpha,$

where

$\eta_s$ is the stochastic error,
$\eta_b = O(\nu_p \tau^{1-p})$ is the bias from Huber truncation,
$\eta_a$ is the approximation error from the ReLU network class,
$\delta_s$ reflects estimation error scaling as $O((LN)\sqrt{\frac{d\log(dLN)\log n}{n}})$ for network depth $L$ and width $N$ ,
$\delta_4 = \|\hat{f}_n-f_0\|_4 = O_p(n^{-\gamma^*/(2\gamma^*+1)})$ , with $\gamma^*$ determined by the hierarchical compositional structure assumed of $g_0$ .

For sub-Gaussian errors ( $\epsilon_-$ light-tailed), DRES matches the efficiency of deep least squares (DES) approaches; for heavy tails, DRES outperforms DES due to reduced sensitivity to outliers (Yu et al., 11 Nov 2025).

4. Neural Network Architecture and Curse-of-Dimensionality Mitigation

The estimator leverages hierarchical composition models $\mathcal{H}(d,\ell,M_0,\mathcal{P})$ where $f_0$ and $g_0$ are compositions of low-rank Hölder-smooth functions, enabling the use of deep ReLU networks of moderate size to overcome the curse of dimensionality. Networks are organized with sufficient depth $L$ and width $N$ such that the $L_2$ -approximation error admits

$O\left((L_0 N_0)^{-2\gamma^*}\right),$

with $\gamma^* = \min_{(\beta,t)\in\mathcal{P}} \beta/t$ determined by layers’ smoothness and interaction order (Yu et al., 11 Nov 2025).

5. Empirical Performance and Case Studies

Simulation studies in $d=8$ dimensions (sample size $n=4096$ ) show that DRES achieves near-oracle mean squared prediction error for both light-tailed (Gaussian) and heavy-tailed ( $t_{2.25}/3$ ) noise, outperforming local linear ES (LLES) and non-robust DES in the latter regime. Under heavy tails, DRES exhibits dramatically improved accuracy and monotonicity enforcement when combined with non-crossing regularization.

In an environmental science application, DRES estimated upper-tail ES ( $\alpha=0.99$ ) for monthly precipitation conditional on El Niño indices and spatial-temporal covariates. Robust ES inference revealed spatial teleconnections better than mean-based analysis, e.g., mapping increased risk of extreme rainfall in southern California and the Gulf Coast. Variable importance metrics confirmed key covariates (longitude, latitude, Niño index) for tail event prediction (Yu et al., 11 Nov 2025).

6. Algorithmic Implementation and Practical Considerations

Input data: $\{(X_i,Y_i)\}_{i=1}^n$ , quantile level $\alpha$ , network hyperparameters, Huber parameter $\tau$ .
Train DQR to estimate $f_0$ .
Compute $Z_i(\hat{f}_n)$ and fit the DRES network for $g$ using Huber loss.
For multiple $\alpha$ values, enforce monotonicity of joint quantile/ES outputs if needed.
The choice of $\tau$ requires balancing bias (too high $\tau$ ) and sensitivity to outliers (too low $\tau$ ), with theoretical guidance for scaling with sample size.

A plausible implication is that the two-stage network plus Huber robustification pipeline constitutes a best-practice route for ES estimation when signal structure is compositional and errors are non-sub-Gaussian.

7. Relationship to Other Robust Deep Estimation Frameworks

Robust Deep ES Estimation is distinct from both deep energy-score estimators (Saremi et al., 2018) and robust deep likelihood-based maximum likelihood estimators (such as DeepMLE (Xiao et al., 2022));

The former addresses unsupervised density/scoring function estimation, not supervised tail risk.
DeepMLE (Xiao et al., 2022) employs mixture models and explicit uncertainty prediction for geometric vision tasks, emphasizing Gaussian-uniform mixture robustness at pixel-level, while robust deep ES regression addresses tail conditional functionals with respect to covariate distributions.

The robust deep ES estimator also contrasts with black-box evolutionary strategies, which optimize noise-averaged objectives for parameter robustness (Lehman et al., 2017, Meier et al., 2019). Instead of searching the parameter space for perturbation-invariant optima, the DRES mathematically targets conditional tail means, robust to heavy-tailed responses by direct construction and with formal statistical guarantees (Yu et al., 11 Nov 2025).

References

“Deep neural expected shortfall regression with tail-robustness” (Yu et al., 11 Nov 2025)
“Deep Energy Estimator Networks” (Saremi et al., 2018)
“DeepMLE: A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion” (Xiao et al., 2022)
“ES Is More Than Just a Traditional Finite-Difference Approximator” (Lehman et al., 2017)
“Improving Gradient Estimation in Evolutionary Strategies With Past Descent Directions” (Meier et al., 2019)