End-to-End PO Learning

Updated 27 April 2026

End-to-end PO Learning is an approach that integrates predictive modeling with optimization by training models with gradient signals from downstream decision losses.
It employs techniques like implicit differentiation via KKT conditions, quadratic smoothing, and differentiable optimization layers to streamline gradient computations.
Empirical studies in domains such as energy dispatch and inventory management show significant cost reductions and performance improvements over traditional methods.

End-to-end Predict-then-Optimize (PO) Learning is a paradigm in machine learning and optimization where predictive models are trained not only for statistical accuracy but to directly improve the quality of downstream decisions obtained by solving an optimization problem whose parameters are themselves outputs of the learned model. Rather than following the conventional two-stage pipeline of separately training a predictor (e.g., via maximum likelihood or mean squared error) then optimizing with its outputs, end-to-end PO learning "closes the loop," allowing gradient or policy signals to flow from the final task loss back through both the optimization procedure and the prediction architecture. This approach has been formalized and analyzed in diverse domains, including stochastic programming, robust dispatch in energy systems, speech recognition, and decision-focused resource allocation, among others (Donti et al., 2017, Cameron et al., 2021, Rychener et al., 2023, Lu et al., 2020, Zhang et al., 2022).

1. Mathematical Foundations and Formalism

End-to-end PO learning considers problems where one predicts uncertain or unknown parameters $\theta$ (or distributions $P_\theta$ ) based on observed input features $z$ , and the ultimate goal is to make a downstream decision $x^*(\theta)$ (or $z^*$ ) by solving an optimization problem: $x^*(\theta) = \arg\min_{x \in \mathcal{X}} f(x;\theta)$ where $f(x;\theta)$ is the task loss or cost. Standard approaches train the predictor $M_\phi(z)$ by minimizing a statistical loss (e.g., $L_\mathrm{pred}$ ), ignoring the structure of $f$ . In contrast, the end-to-end PO approach defines the true loss as the downstream loss obtained by solving for $P_\theta$ 0 and evaluating $P_\theta$ 1 under the ground-truth parameters.

Formally, the objective is: $P_\theta$ 2 where $P_\theta$ 3 may itself depend on a solution to a (parametric) conic program, quadratic program, linear program, or other combinatorial problem.

A key technical challenge is the need to differentiate through $P_\theta$ 4, i.e., to backpropagate the outer loss gradient through the argmin solution of an optimization problem (Donti et al., 2017, Rychener et al., 2023).

2. Algorithmic and Computational Techniques

Several algorithmic strategies have been developed to enable end-to-end PO learning:

Implicit Differentiation via KKT Conditions: When the inner optimization is convex and strongly regular, the solution map $P_\theta$ 5 is differentiable almost everywhere. Using the Karush–Kuhn–Tucker (KKT) conditions, one can derive closed-form or linear-system expressions for the Jacobian $P_\theta$ 6 (Donti et al., 2017, Rychener et al., 2023, Lu et al., 2020, Zhang et al., 2022).
Quadratic or Envelope Smoothing: For non-smooth or piecewise-linear objectives, quadratic regularization is added (e.g., $P_\theta$ 7) to enforce strong convexity and enable implicit differentiation. Moreau envelopes are also used to smooth non-differentiable objectives (Dinh et al., 2024).
Surrogate and Differentiable Proxies: When $P_\theta$ 8 includes non-differentiable penalties (e.g., piecewise costs, Ordered Weighted Averaging), surrogate loss functions or smooth risk proxies are used to facilitate backpropagation (Lu et al., 2020, Dinh et al., 2024).
Differentiable Optimization Layers: Implementations leverage differentiable QP/LP solvers (e.g., OptNet, CVXPYLayers, custom Lagrange-based solvers) to handle the solution and gradient computation automatically within the computational graph (Donti et al., 2017, Zhang et al., 2022, Dinh et al., 2024).
Hybrid Losses: In some frameworks (SimPO, AIPO), a weighted sum of statistical and task-driven losses is minimized, interpolating between predict-then-optimize (two-stage) and fully task-based end-to-end optimization (Zhang et al., 2022, Shen et al., 2024).

3. Theoretical Analysis: Performance Gaps and Guarantees

End-to-end PO learning is theoretically justified by showing that direct optimization of the downstream task loss can dominate the two-stage (statistically optimal, but task-agnostic) approach whenever the task cost penalizes prediction errors unequally or depends on multiple correlated predictions (Cameron et al., 2021).

Price of Correlation (POC): The performance gap between two-stage and end-to-end approaches can be quantified using the POC in stochastic optimization. In scenarios where cost coefficients are nonlinear functions (e.g., products) of predicted random variables, the two-stage approach can be arbitrarily suboptimal (uncorrelated predictions yield incorrect objective coefficients), while end-to-end methods adaptively trade off prediction errors to minimize the realized cost (Cameron et al., 2021).
Convexity, Uniqueness, and Stability: When the inner optimization is convex and the cost function is continuous piecewise-linear (e.g., economic dispatch), the solution mapping is unique and gradients are well-behaved, ensuring stable optimization and convergence of end-to-end training (Lu et al., 2020).
Bayesian Perspective: The standard end-to-end PO algorithm has a Bayesian interpretation: it learns a parametric approximation to the posterior Bayes action map $P_\theta$ 9 (Rychener et al., 2023).

4. Application Domains

End-to-end PO learning has been empirically validated in diverse real-world settings:

Domain	Optimization Layer	Empirical Benefits
Power system dispatch	LP/QP with piecewise-linear cost, constraints	3–5% reduction in dispatch cost; robustness to error distributions; 182% faster training with dedicated kernel (Lu et al., 2020)
Inventory "newsvendor"	QP; stochastic demands	47% lower task cost vs. MLE linear baseline under misspecification; closely matches ideal policy when model is correct (Donti et al., 2017)
Grid scheduling, energy storage	Stochastic or robust QP with ramp/storage	38% lower task cost vs. RMSE forecasting on grid scheduling; up to 102% better profit in energy storage (Donti et al., 2017)
Multi-objective/fairness	OWA-layer with robust optimization	30–50% regret improvement on robust portfolio; 10–20% worst-case path-length improvements in multi-species routing (Dinh et al., 2024)
LLM preference optimization	Direct Preference Optimization (DPO), MaPPO	State-of-the-art win rates; controls failure modes in iterative PO (length exploitation, overfitting) (Shen et al., 2024, Lan et al., 27 Jul 2025)
Speech recognition	REINFORCE/SCST on WER, policy gradient	4–13.8% relative WER gain vs. maximum likelihood CTC baseline (Zhou et al., 2017)

The technique is broadly applicable wherever task objectives are not aligned with prediction accuracy; e.g., in learning-to-rank, resource allocation, robust path planning, LLM alignment (Dinh et al., 2024, Lan et al., 27 Jul 2025).

5. End-to-End Training Workflow and Practical Considerations

A generic end-to-end PO workflow involves:

Define a parameterized predictor $z$ 0 for model parameters $z$ 1.
For each example, solve $z$ 2 using a differentiable solver.
Evaluate the true loss $z$ 3 using ground-truth parameters.
Compute gradients by differentiating through the solver (via KKT, autodiff, or dedicated backward routines).
Update predictor parameters via standard SGD or Adam.

Complexity considerations:

The cost of backpropagating through an optimization layer scales as $z$ 4 for dense QPs, $z$ 5 for certain structured problems (dispatch with piecewise-linear cost).
Approximate or surrogate gradients may be required for combinatorial or non-convex problems (Dinh et al., 2024).
Smoothing or regularization is often used to avoid non-differentiabilities.

6. Extensions and Variants

Recent work extends the PO framework:

Iterative and Preference-Based Optimization: In LLM alignment, Preference Optimization (PO) methods such as DPO, AIPO, and MaPPO enable efficient preference learning and combat pathology such as length exploitation and overfitting by incorporating agreement-aware or MAP-based margins (Shen et al., 2024, Lan et al., 27 Jul 2025).
Curriculum and Adaptive PO: Curriculum-guided policy optimization dynamically adjusts the difficulty of task instances, creating an online feedback loop that broadens coverage and improves reasoning at scale (Zhang et al., 29 Sep 2025).
Distributionally Robust PO: DRO-based decision maps internalize ambiguity sets during training, improving tail performance in data-shift or model-misspecification regimes (Rychener et al., 2023).
Hybrid Statistical-Task Losses: SimPO combines prediction and optimization losses to interpolate between classic supervised learning and decision-focused learning (Zhang et al., 2022).

7. Limitations, Challenges, and Open Directions

End-to-end PO learning is most beneficial when the following are true: the loss surface induced by the prediction error is misaligned with downstream cost; the cost function aggregates multiple correlated or nonlinear functionals of the predictions; or fairness/robustness requirements are central to the application (Cameron et al., 2021, Dinh et al., 2024).

Limitations include:

Computational cost of inner-loop optimization and gradient computation for each training example.
Applicability largely to convex and differentiable inner problems; extensions to nonconvex combinatorial settings remain more challenging.
When predictors are well-specified and the statistical loss is task-aligned (e.g., the cost is truly minimized when the model outputs conditional expectations), there may be no advantage over classic two-stage methods.

Recent work suggests further directions: scalable and efficient solver integration, adaptive curriculum feedback, robustification (DRO), and advanced preference optimization objectives for LLM and complex human-in-the-loop applications (Shen et al., 2024, Lan et al., 27 Jul 2025, Zhang et al., 29 Sep 2025).