Predict-then-Optimize: Data-Driven Decisions

Updated 2 July 2026

Predict-then-Optimize is a two-stage framework that first predicts uncertain parameters from features and then solves an optimization problem using these estimates.
It emphasizes decision-focused learning by aligning standard prediction losses with the downstream impact on decision quality through surrogate losses like SPO+.
Recent advances extend PtO to complex domains such as mixed-integer programming, sequential decision-making, multi-task scenarios, and scalable real-world applications.

The predict-then-optimize (PtO) paradigm integrates statistical learning and mathematical optimization by first predicting the uncertain parameters of an optimization problem from features and then solving the problem using these predictions. It is a foundational methodology for data-driven decision-making under uncertainty, enabling end-to-end pipelines that exploit both predictive models and domain-specific optimization structure. Research in this area focuses on aligning the statistical training objective with decision quality, quantifying generalization and regret, developing tractable surrogate losses and scalable algorithms, and extending PtO to rich classes of optimization problems, including mixed-integer, sequential (MDP), multi-task, and real-world applications.

1. Core Predict-Then-Optimize Framework

The canonical PtO framework considers a two-stage pipeline:

Prediction: Learn a function $f: \mathcal X \to \Theta$ mapping observed features $x$ to unknown parameters $\theta$ of an optimization problem. Examples include predicting costs $c$ , constraints, or transition/reward parameters for a Markov Decision Process (MDP).
Optimization: Solve the downstream optimization, substituting predicted parameters:

$\hat z = \underset{z \in \mathcal Z}{\arg\min}\, F(z, \hat \theta)\;.$

The principal evaluation metric is the decision-based regret or suboptimality:

$\operatorname{Regret}(f) = \mathbb E_{(x, \theta)}\left[ F(\hat z, \theta) - F(z^*, \theta) \right]\;, \quad\text{where } z^* = \arg\min_{z \in \mathcal Z} F(z, \theta)\;.$

Optimization problems include linear or integer programming, quadratic programming, and sequential decision problems (MDPs).

Most problem instantiations must address the misalignment between standard prediction losses (e.g., mean squared error for $f$ ) and the downstream impact on decision quality. This has motivated a research focus on decision-focused learning objectives and algorithmic pipelines that directly minimize regret (Elmachtoub et al., 2017, Balghiti et al., 2019, Liu et al., 2021).

2. Surrogate Losses and Fisher Consistency

Direct minimization of the regret objective (e.g., the Smart Predict-then-Optimize (SPO) loss) is generally intractable. For linear objectives,

$\ell_{\text{SPO}}(\hat c, c) = c^\top w^*(\hat c) - c^\top w^*(c)$

is nonconvex and discontinuous, where $w^*(c)$ denotes the optimal decision under cost $c$ .

A convex surrogate, the SPO $x$ 0 loss, is derived by duality arguments:

$x$ 1

(Elmachtoub et al., 2017, Liu et al., 2021). Under mild regularity conditions (centrally symmetric $x$ 2, unique minimizer), minimizing SPO $x$ 3 risk is Fisher consistent: the $x$ 4 minimizing the surrogate recovers the $x$ 5 minimizing regret.

Further, the calibration function between excess surrogate risk and excess true risk depends on the geometry of the feasible set $x$ 6, with linear-in- $x$ 7 calibration for strongly convex $x$ 8 and quadratic for bounded polyhedra (Liu et al., 2021).

3. Generalization Guarantees and Margin Bounds

Generalization risk for decision losses is characterized by Rademacher- or combinatorial-complexity-based bounds (Balghiti et al., 2019). For the non-Lipschitz, nonconvex SPO loss, results include:

Natarajan dimension bounds: Relate the hypothesis class and the complexity of $x$ 9 (e.g., the number of extreme points).
Margin-based bounds: By defining a margin around the set of degenerate cost vectors (where $\theta$ 0 is non-unique), one constructs a Lipschitz-continuous “margin SPO loss” that allows for tighter generalization bounds. When the “strength property” holds (polyhedral or strongly convex $\theta$ 1), the dependence on the number of extreme points can be removed, replaced by the (typically much smaller) multivariate Rademacher complexity of the function class (Balghiti et al., 2019).

The margin-based loss admits efficient computation both for polyhedral and strongly convex feasible sets, and the generalization rate is typically $\theta$ 2 in function class dimension and sample size.

4. Algorithmic Strategies: End-to-End and Solver-Free Approaches

Two main algorithmic categories have emerged:

End-to-End Differentiation: Embeds the optimization solver within the training loop so as to backpropagate gradients of surrogate decision-focused losses. For LPs and IPs, methods include differentiable relaxations, perturbation-based smoothings (Perturbed Fenchel-Young, DBB), and convex surrogates like SPO $\theta$ 3 (Tang et al., 2022). For combinatorial or large-scale problems, specialized approaches such as Cone-aligned Vector Estimation (CaVE) align predicted cost vectors with the normal cone of the optimal solution, allowing fast QP-based training (Tang et al., 2023).
Solver-Free Surrogates: To address computation bottlenecks, solver-free surrogates such as the Weight-Integrated Spherical Error (WISE) loss exploit scale- and direction-invariance properties, enabling fully differentiable, regret-aligned training without solver calls (Wan et al., 17 Jun 2026).
Learning-to-Optimize Proxies: Rather than predict parameters, one learns a direct mapping from features to optimal solutions, training via supervised losses against pre-solved optima or KKT surrogates, thus sidestepping the need for solver calls in training altogether (“LtOF”) (Kotary et al., 2023, Kotary et al., 2024).

For sequential decision problems (MDPs), decision-focused PtO methods embed RL planning solvers (value iteration, policy gradients) as differentiable layers in the outer training loop, using sampling-based unbiased estimators and low-rank approximations to handle intractable KKT derivatives (Wang et al., 2021).

5. Extensions: Dependent Data, Multi-Task, and Real-World Applications

Dependent Data: Autoregressive and time series extensions of PtO establish generalization and calibration results under stationary $\theta$ 4-mixing processes, showing that the dependence structure predominantly affects regret bounds via mixing coefficients (Liu et al., 2024). Empirically, decision-oriented (SPO $\theta$ 5) training achieves lower regret under dynamical system-induced dependencies compared to standard prediction losses.
Multi-Task & Multi-Stage: When multiple optimization tasks (possibly with differing feasible regions) share features, multi-task architectures leverage shared representations and decision-focused losses to transfer information and improve sample efficiency (Tang et al., 2022). Generalization and calibration results extend, provided appropriate complexities are controlled.
High-Impact Domains: PtO frameworks have been deployed for revenue-maximizing fund allocation under constraints (Tang et al., 5 Mar 2025), uplift modeling with continuous treatments (Vos et al., 2024), robust drone-swarm wildfire fighting (Pan et al., 2024), and LLM-enhanced clinician scheduling (Jha et al., 2 Oct 2025). Each adopts domain-specific prediction targets and optimization models, incorporating operational constraints, fairness, and risk.

6. Empirical Insights, Heuristics, and Best Practices

End-to-End Decision-Focused Learning almost always improves out-of-distribution decision quality relative to naïve two-stage predictors, especially when predictive models are misspecified or under limited data (Shah et al., 2023, Elmachtoub et al., 2017, Tang et al., 2022, Tang et al., 2023).
Surrogate Loss Selection: SPO $\theta$ 6 and WISE are Fisher-consistent surrogates in standard linear and combinatorial settings; in binary/integer cases, CaVE achieves favorable speed-quality trade-offs by leveraging cone geometry (Tang et al., 2023). In highly nonlinear or nonconvex settings, learning-to-optimize proxies are both efficient and competitive (Kotary et al., 2023, Kotary et al., 2024).
Sample Complexity: Feature-based parameterizations and model-based sampling (e.g., Efficient Global Losses, EGL (Shah et al., 2023)) can dramatically reduce the data needed to fit effective loss surrogates, especially when prediction errors are globally dispersed rather than locally Gaussian.
Scalability: Heuristic solution algorithms (e.g., for combinatorial allocation) can yield near-optimal performance at a fraction of the time cost of exact solvers (Tang et al., 5 Mar 2025), and Benders decomposition or robust convexification strategies maintain tractability under high-dimensional or uncertain environments (Pan et al., 2024).
Generalization under Model Misspecification: Decision-aligned surrogate losses (SPO $\theta$ 7, WISE, decision-focused proxies) exhibit pronounced performance advantages in the presence of strong nonlinearities, misspecification, and limited samples.

7. Outlook and Research Frontiers

Ongoing challenges and opportunities in the field include:

Theoretical Generalization under Non-i.i.d. Settings: Sharper guarantees for regret and calibration when data are dependent, arising from dynamical systems or temporal structure (Liu et al., 2024).
End-to-End Uplift and Fairness: Extension of PtO to fairness-constrained and uplift-modeling contexts with continuous actions, where both predictive and prescriptive modeling must be sensitive to equity and multi-group utility trade-offs (Vos et al., 2024).
Solver-Free and Black-Box Methods: Fully eliminating the need for solver calls in training (beyond test-time), especially for large or combinatorial problems, remains a focus. WISE loss (Wan et al., 17 Jun 2026) and learning-to-optimize proxies are prominent examples.
Adaptive Post-Estimation Correction: When prediction errors induce asymmetric regret, small, closed-form post-estimation corrections (based on curvature information) can yield uniformly lower regret, especially in small-sample regimes (Albert et al., 28 Jul 2025).
Scalable Application to Industry: Modern toolkits (e.g., PyEPO (Tang et al., 2022)) and heuristic solvers unlock PtO formulations at the million-variable scale, supporting applications in recommendation, resource allocation, and scheduling.

The PtO paradigm provides a flexible and theoretically grounded bridge between predictive modeling and operational decision-making, with a rapidly growing research and applications ecosystem in both theory and large-scale practice.