End-to-End Predict-then-Optimize (EPO)

Updated 26 January 2026

End-to-End Predict-then-Optimize (EPO) is a unified machine learning–optimization paradigm that integrates prediction and decision-making via differentiable loss functions.
It employs techniques like implicit differentiation, surrogate losses, and unrolled algorithms to propagate gradients through optimization layers effectively.
EPO frameworks have demonstrated practical gains in applications such as portfolio optimization, resource allocation, and real-time decision-making, significantly reducing decision regret.

End-to-End Predict-then-Optimize (EPO) describes a unified machine learning–optimization paradigm in which predictive models are trained directly for downstream decision quality rather than purely for predictive accuracy. Unlike the traditional two-stage “predict then optimize” pipeline—which estimates unknown problem parameters and solves a decision problem separately—EPO frameworks integrate the optimization task as a differentiable “layer” or learning target within model training. The objective is to address the well-documented phenomenon that minimizing prediction error does not necessarily yield optimal decisions when the decision rule is highly sensitive or nonlinearly related to prediction error. Recent research proposes various architectures, loss functions, and training principles enabling EPO across linear, combinatorial, multi-stage, and stochastic optimization settings.

1. Core EPO Principles and Loss Structures

At the heart of EPO is the direct coupling of prediction and decision through an end-to-end differentiable loss that reflects both predictive fit and decision impact. In SimPO (Zhang et al., 2022), the composite loss is formalized as

$F(\theta) = l(y_{\rm train}, \hat y_{\rm train}(\theta)) \cdot \omega(\tilde z, z^*_{\rm train}, \alpha) + \mathbf{E}_{y\sim\Pr(y|x,z;\theta)}[g(\tilde z, y)] \cdot \gamma(z^*_{\rm train}, z^*_{\rm test}, \beta)$

where $l$ is a standard prediction loss (e.g., MSE), $g$ is the downstream optimization cost evaluated in expectation, and $\omega, \gamma$ are weighting functions modulating trade-offs between prediction fidelity and decision sensitivity. The special case $L_{\rm total}(\theta) = \alpha l + (1-\alpha) \min_z \mathbf{E}[g(z, y)]$ appears widely as the canonical EPO objective.

In combinatorial and multi-stage settings, similar weighted schemes apply, where the loss may reflect regret, task-specific divergence, or multi-objective fairness criteria (e.g., Ordered Weighted Averaging in resource allocation (Dinh et al., 2024)).

2. Differentiable Optimization Layers and Gradient Computation

EPO critically depends on the ability to propagate gradients through the optimization problem with respect to model parameters. The primary approaches include:

Implicit Differentiation via KKT Conditions: When the downstream task is a convex program (QP or LP), it admits closed-form optimality conditions (Karush-Kuhn-Tucker). Differentiating the KKT system with respect to model parameters yields

$\frac{dF}{d\theta} = \frac{\partial l}{\partial \theta}\,\omega + (1-\alpha) \frac{\partial f(z^*(\theta); \theta)}{\partial z^*} \frac{\partial z^*(\theta)}{\partial \theta} + (1-\alpha) \frac{\partial f(z^*(\theta); \theta)}{\partial \theta}$

Algorithmically, this is realized using automatic differentiation through QP solvers or custom backward passes (Zhang et al., 2022).

Differentiable Surrogates: For integer/discrete or nonsmooth settings, soft relaxation (e.g., log-barrier, quadratic penalty, Moreau envelope) enables approximate but smooth gradients (Mandi et al., 2020, Dinh et al., 2024).
Unrolled Algorithms: Alternating Direction Method of Multipliers (ADMM) or projected-gradient steps can be treated as unrolled network layers, supporting backpropagation via the chain rule across iterations (Wang et al., 2024).
Implicit Function Theorem with Monte Carlo Sampling: In sequential or RL settings, unbiased derivative estimates of the optimization fixed-points are computed using sampled trajectories, with Hessians approximated via low-rank Woodbury identities (Wang et al., 2021).
Proxy and Cone-Alignment Methods: For binary or combinatorial problems, the predicted cost vector is directly aligned to the normal cone of the optimal solution, avoiding explicit integer program solves (Tang et al., 2023).

3. Representative Algorithms and Their Instantiations

EPO frameworks admit diverse algorithmic realizations depending on the downstream problem:

SimPO (Simultaneous Prediction and Optimization) (Zhang et al., 2022): Couples MSE and decision loss via weighted objectives and backpropagates through a constrained inner optimization (typically convex).
End-to-End Clustering for Assignment (Zhang et al., 2022): Integrates a GCN-based predictor for task demand with a differentiable, constraint-regularized K-means clustering layer, jointly maximizing modularity in courier assignment.
Decision Trees under SPO Loss (Elmachtoub et al., 2020): Trains interpretable segmentation trees with splits and leaf assignments chosen to minimize empirical excess downstream cost, rather than MSE.
Energy-Based Model Surrogates (Kong et al., 2022): Used when KKT differentiation is impractical, parameterize the expected loss as an “energy,” with gradients computed via maximum-likelihood and KL regularization, approximated by importance sampling.
PyEPO Toolkit (Tang et al., 2022): Provides several EPO algorithms, including convex SPO+ surrogates, differentiable black-box finite-difference, perturbation-based Monte Carlo (DPO), and Fenchel-Young losses, leveraging differentiable solvers or gradient approximations.
Proxy and LtOF Approaches (Kotary et al., 2023, Kotary et al., 2024): Bypass explicit argmin differentiation by learning the mapping from features directly to optimal decisions with learning-to-optimize losses; suitable for convex, nonconvex, or combinatorial downstream problems.

4. Theoretical Properties and Limitations

The theoretical underpinnings of EPO frameworks are problem-specific:

Convergence: For convex losses and differentiable solvers, stationary-point convergence is standard, though formal proofs are often omitted in initial work (e.g., SimPO (Zhang et al., 2022)).
Optimality Gap Bounds: Explicit error bounds for end-to-end vs. two-stage approaches depend on the “price of correlation” (POC) in stochastic optimization. For certain nonlinearly-combined or correlated multi-target objectives, two-stage approaches can be unboundedly suboptimal relative to fully EPO training (Cameron et al., 2021).
Regret Consistency: Weighted MSE losses with globally parameterized feature-based weights are Fisher consistent for linear objectives (Shah et al., 2023).
Bias Control: Decision-focused fine-tuning (DFF) frameworks enforce strict deviation bounds between trained predictors and original backbones (with RMSE and angular bias guarantees under trust-region constraints) (Yang et al., 3 Jan 2025).

Limitations include the computational cost of backward passes (especially with integer or nonconvex programs), absence of general regret bounds in nonconvex/discrete settings, and the dependence of solution quality on the approximability or capacity of learned optimization proxies.

5. Empirical Evaluation and Domains of Application

Experiments conducted across this literature span synthetic and real-world domains including:

Inventory and demand forecasting, network flow, portfolio optimization, shortest-path, news recommendation, seaport scheduling, vehicle repositioning, and express courier assignment (Zhang et al., 2022, Wang et al., 2024, Pu et al., 11 Nov 2025, Yang et al., 3 Jan 2025).

Notable findings include:

EPO-trained models can achieve 1.2–10.2% modularity gains in assignment (Zhang et al., 2022), halve decision regret relative to two-stage pipelines in robust portfolio and fair allocation tasks (Dinh et al., 2024), and substantially outperform two-stage approaches when objective cost combines multiple, correlated predictions (Cameron et al., 2021).
In combinatorial settings (e.g., TSP, CVRP), cone-alignment yields an order-of-magnitude speedup with negligible quality loss (Tang et al., 2023).
Real-time inference is possible in learn-to-optimize-from-features (LtOF) frameworks, which achieve comparable performance to EPO in convex and superior performance in nonconvex or combinatorial cases (Kotary et al., 2024).
Decision-focused continual learning with Fisher regularization delivers both improved generalization and controlled catastrophic forgetting in evolving task streams (Pu et al., 11 Nov 2025).

6. Advancements, Toolkits, and Research Directions

EPO methodology has seen significant development in both practical toolkits and theoretical frameworks.

Toolkits: PyEPO (Tang et al., 2022) is a modular PyTorch library supporting LP/IP objectives and a range of EPO-compatible surrogate losses and differentiation methods.
EPO with black-box or non-differentiable predictors: DFF (Yang et al., 3 Jan 2025) offers bias-corrected end-to-end fine-tuning for non-differentiable/black-box settings, with explicit bounds on divergence from the original model.
Continual and Fair EPO: Decision-focused continual learning utilizes Fisher information-based regularization for streaming tasks (Pu et al., 11 Nov 2025). Multiobjective fairness criteria (e.g., OWA objectives) are supported by smoothed QP and Moreau-envelope surrogates (Dinh et al., 2024).
Proxy-based EPO: LtOF and related approaches unify EPO with the learn-to-optimize paradigm, offering scalability to nonconvex programs and avoiding solver differentiation (Kotary et al., 2023, Kotary et al., 2024).

Current research directions include extending EPO to fully combinatorial/graphical problems using RL-based proxies, deriving more general distribution-dependent regret bounds, improving gradient estimation for highly nonconvex or nonsmooth objectives, and developing problem-agnostic EPO layers compatible with large-scale black-box solvers.