Predict-Then-Optimize Paradigm

Updated 6 February 2026

Predict-then-optimize is a two-stage framework that first uses machine learning to predict key parameters and then deploys mathematical optimization to minimize decision regret.
It emphasizes decision-focused learning by aligning predictive models with downstream optimization through surrogate losses such as SPO and SPO⁺.
The approach is applied in scheduling, portfolio allocation, logistics, and resource management, while ongoing research addresses scalability, robustness, and sequential extensions.

The predict-then-optimize (PTO) paradigm refers to a two-stage framework for decision-making under uncertainty, in which machine learning models first generate predictions of problem parameters from observed features, and these predictions are then input to an optimization algorithm to select actions or decisions. This paradigm is widely employed in operations research, machine learning, and applied optimization, including applications such as scheduling, portfolio allocation, logistics, pricing, and resource allocation under uncertainty. PTO combines data-driven predictive modeling with the mathematical structure of downstream optimization and has motivated a comprehensive body of research on decision-focused learning, surrogate loss design, generalization guarantees, extensions to sequential and stochastic settings, and robustification strategies.

1. Formalization of the Predict-Then-Optimize Workflow

The canonical PTO workflow consists of the following sequential steps (Elmachtoub et al., 2017, Wang et al., 2021):

Prediction step: Given features $x \in \mathcal{X}$ , a model $f_\theta: \mathcal{X} \to \Theta$ predicts unknown parameters $\hat\theta = f_\theta(x)$ (e.g., costs, rewards, transition probabilities, etc.).
Optimization step: The predicted parameter $\hat\theta$ is used as an input to an optimization problem of the form

$z^*(\hat\theta) = \arg\min_{z \in \mathcal{Z}} f(z; \hat\theta)$

where $f$ is the objective and $\mathcal{Z}$ is the feasible set.

Deployment: The solution $z^*(\hat\theta)$ is implemented, resulting in an observed loss or performance measured against the true (but typically unobserved) parameters $\theta$ .

The core goal in PTO is not the accuracy of $\hat\theta$ as an estimator, but the quality of the downstream solution $z^*(\hat\theta)$ —quantified by the regret

$\Delta(z^*(\hat\theta), \hat\theta) = f(z^*(\hat\theta); \theta) - f(z^*(\theta); \theta)$

and its expectation over the data-generating distribution. This performance metric promotes aligning the training of $f_\theta$ with the final decision quality, rather than generic prediction error.

2. Surrogate Losses and Decision-Focused Learning

Standard PTO implementations often train $f_\theta$ using MSE or generic regression/classification losses, which, while statistically consistent for prediction accuracy, are often misaligned with decision utility (Elmachtoub et al., 2017, Elmachtoub et al., 2020). This recognition led to the development of decision-focused surrogate losses that prioritize minimizing post-decision regret. Principal contributions in this area include:

SPO Loss and SPO $^+$ Surrogate: The Smart Predict-then-Optimize (SPO) loss (Elmachtoub et al., 2017) directly quantifies the excess cost from implementing $z^*(\hat\theta)$ under true $\theta$ . The SPO $^+$ surrogate is a convex upper bound shown to be Fisher-consistent for the regret in many settings and efficiently computable for polyhedral and convex feasible sets (Liu et al., 2021).
Simulation-based Evaluation: Simulators for model performance, particularly in classification-driven PTO, allow quantifying how confusion-matrix rates (TPR, FPR) propagate to decision gaps for complex optimization tasks (Smet, 2 Sep 2025). These tools reveal nonlinear, highly problem-specific tradeoffs between prediction error types and optimality gap.
End-to-End Training via KKT Differentiation: Approaches embedding the optimization problem in the learning loop differentiate through argmin operators using implicit function theorem and KKT conditions, as in reinforcement learning or structured prediction settings (Wang et al., 2021). This enables gradient-based training focused directly on post-decision utility.
Efficient Global Losses (EGL): Departing from local, instance-wise surrogates, EGL designs global, feature-parameterized surrogate losses validated over a wider support of likely prediction errors, yielding both improved sample efficiency and robustness (Shah et al., 2023).

3. PTO Beyond Single-Stage Settings: Sequential, Multi-Task, and Online Extensions

PTO has been extended well beyond classic, static optimization formulations:

Sequential Decision Problems: For MDPs with latent parameters, PTO trains predictors over features $x$ to MDP parameters $\theta$ , optimizing for generalization to new environments without trajectories. Scalability in gradient estimation is achieved via sampling unbiased KKT gradients and Woodbury-inverted low-rank Hessians (Wang et al., 2021).
Multi-Task and Multi-Instance PTO: When multiple related optimization tasks must be solved from a shared or partially shared feature space, multi-task PTO leverages deep architectures with inter-task parameter sharing and shared/individual task heads, combining losses across tasks for joint training (Tang et al., 2022).
Online Contextual and PAC-Bayesian PTO: In dynamic environments, sequential (online) PTO frameworks update predictors iteratively with feedback from downstream decision losses, incorporating Bayesian principled regret bounds and gradient-free SMC inference to accommodate nondifferentiable and structured optimization settings (Xie et al., 25 Nov 2025).
Black-Box and Partial-Feedback PTO: ESR (Empirical Soft Regret) is a differentiable surrogate for regret in black-box settings without complete reward/counterfactual data. ESR achieves model-class optimal regret in the limit (Tan et al., 2024).

4. Robustness, Calibration, and Stochastic PTO

The modularity of PTO lends itself to robust and risk-sensitive extensions:

Predict-Then-Calibrate: This two-phase paradigm decouples point prediction and robustification. An arbitrary (possibly black-box) predictor is complemented by a separate calibration phase that quantifies residual uncertainty (e.g., via quantile regression or conformal sets). The optimization phase then solves a robust or distributionally-robust problem with probabilistic or finite-sample coverage guarantees (Sun et al., 2023).
Scenario PTO (ScenPO): Instead of point estimates, scenario-based PTO generates data-driven distributions (scenarios) of future parameters via quantile neural networks, optimizing via stochastic programming to hedge demand or parameter risk. The scenario approach closes a significant portion of the optimality gap left by point forecast methods, enhancing service levels and fulfillment rates (Jia et al., 2024).
Chance-Constrained and Distributionally Robust PTO: For instance, in the context of wildfire drone swarm management, convex NN-based fire forecast models are embedded into a multi-stage, robust MIP, with chance constraints tractably reformulated via second-order cone and cutting-plane methods (Pan et al., 2024).

5. Theoretical Guarantees and Generalization Analysis

A considerable effort has been devoted to rigorously understanding the statistical and generalization properties of PTO (Balghiti et al., 2019, Liu et al., 2021):

For polyhedral feasible sets, generalization and uniform calibration bounds are proved in terms of the Natarajan dimension of the solution map induced by the hypothesis class and the structure of the feasible region.
Extensions to strongly convex feasible regions achieve linear calibration rates, so that fast convergence in the surrogate loss translates to fast convergence in true regret.
Margin-based analyses and vector-Rademacher complexities yield tight risk bounds even for nonconvex losses or when solution degeneracies are present.
Surrogates like SPO $^+$ have been shown to be Fisher-consistent for the regret for a broad class of problems.

6. PTO in Applications and Practical Design

PTO variants are widely applied, including but not limited to:

Resource scheduling and healthcare: Integration of LLM-derived features and unstructured data into probabilistic prediction of availability, followed by multi-objective MIPs for equitable and feasible roster construction (Jha et al., 2 Oct 2025).
Portfolio optimization: Incorporation of realistic trading frictions and risk constraints, demonstrating that decision-focused training, even with linear predictors, can robustly improve risk-adjusted performance in volatile or non-stationary real-world financial markets (Yi et al., 7 Jan 2026).
Personalized interventions and treatment: Uplift modeling with continuous treatments leverages PTO to estimate dose-response curves and optimally allocate resources via ILP under utility, cost, and fairness constraints (Vos et al., 2024).
Learning-to-Optimize (LtO) Proxies: As an alternative to differentiating through every optimization layer, neural proxies for the solution map (LtO) are trained directly from features to solutions, eliminating the need for stationary parameter estimation or KKT-based backpropagation. This has led to significant improvements in scalability and deployment speed under both convex and nonconvex optimization regimes (Kotary et al., 2023, Kotary et al., 2024).

7. Limitations, Open Problems, and Future Directions

Despite its wide adoption and substantial theoretical progress, several limitations and open questions remain:

PTO’s performance may degrade under large distribution shifts between the prediction and decision phases, or when prediction errors are heavily misaligned with decision-relevant directions (Smet, 2 Sep 2025, Wan et al., 5 Feb 2026).
Nonconvexity and discontinuity in regret losses complicate surrogate construction and sample complexity.
Scalability when embedding large combinatorial or sequential solvers in end-to-end training relies heavily on advances in sampling, differentiable proxy learning, and hardware resources.
Extensions to causal inference, multi-stage stochastic control, partial information feedback, and simultaneous prediction of multiple coupled parameter types remain important, active research directions (Xie et al., 25 Nov 2025, Tan et al., 2024).
Improved surrogate design for highly structured, combinatorial, or black-box optimization layers, as well as high-dimensional prediction domains, continues to be a major area of development.

In summary, the predict-then-optimize paradigm unifies predictive modeling with mathematical programming, enabling decision-making workflows that leverage machine learning for high-impact, structure-aware optimization under uncertainty. Substantial progress in surrogate loss theory, scalable differentiable solvers, robustification strategies, and multi-task and sequential extensions continues to expand the reach and rigor of PTO in both theoretical and practical domains (Elmachtoub et al., 2017, Wang et al., 2021, Sun et al., 2023, Kotary et al., 2023, Jia et al., 2024, Smet, 2 Sep 2025).