Differentiable Quadratic Program (dQP)

Updated 14 December 2025

dQP is a mathematical programming primitive embedding quadratic programming as differentiable layers, enabling end-to-end optimization.
It employs first-order splitting methods like ADMM to deliver efficient forward and backward passes compared to traditional interior-point approaches.
dQP finds applications in control, finance, and image processing, significantly enhancing computational performance in learning-based systems.

A differentiable quadratic program (dQP) is a mathematical programming primitive that allows the solution to a quadratic program to function as a differentiable layer within end-to-end optimization pipelines, particularly neural networks. dQP specifically refers to the integration of quadratic programming solving and differentiation, such that both forward (solution computation) and backward (sensitivity/gradient computation) passes are efficiently handled and scalable to medium and large-scale instances, typically leveraging first-order or splitting methods rather than classical interior-point schemes. This paradigm enables direct embedding of constrained optimization as trainable modules—a central component for applications in control, financial engineering, and learning-based prediction-optimization.

1. Quadratic Program Layer: Canonical Formulation

dQP layers operate on the canonical parametric convex quadratic program: $\min_{x\in\R^n} \frac{1}{2}x^\top P x + q^\top x,\quad \text{subject to}\quad A x = b,\quad G x \le h$ where $x$ is the decision variable, $P\in\R^{n\times n}$ (positive semidefinite), $q\in\R^n$ , $A\in\R^{m\times n}$ , $b\in\R^m$ , $G\in\R^{p\times n}$ , $h\in\R^p$ . For convexity and well-posedness, $P\succeq 0$ and the feasible set is assumed nonempty (Butler et al., 2021). Many extensions absorb equalities and inequalities into bounded forms ( $\ell\le Ax \le u$ ), introduce auxiliary slack variables, or specialize to conic quadratic and quadratic cone programs (QCPs) (Healey et al., 24 Aug 2025).

In neural network architectures, the problem data $\theta$ (e.g., $P$ , $q$ , $A$ , $b$ , $G$ , $h$ ) can be generated by earlier layers, and the mapping $x^\star(\theta)$ serves as a nontrivial, constrained activation or projection function.

2. Forward Pass: First-Order Splitting via ADMM

dQP layers predominantly eschew dense KKT or interior-point solves in favor of first-order operator-splitting techniques, notably the Alternating Direction Method of Multipliers (ADMM) (Butler et al., 2021, Butler, 2023). A prototypical ADMM-based dQP layer reformulates the QP as a variable-split constrained minimization: $\min_{x,z} f(x) + g(z)\quad\text{subject to}\quad x-z=0,$ with $f(x) = \frac{1}{2}x^\top P x + q^\top x + \mathbb{I}_{Ax = b}(x)$ and $g(z) = \mathbb{I}_{Gz \leq h}(z)$ . The scaled-dual ADMM scheme performs alternating updates:

$x^{k+1}$ : equality-constrained QP via KKT solve
$z^{k+1}$ : projection onto the feasible polyhedron (often implemented as per-coordinate clamping)
$u^{k+1}$ : dual variable update

This splitting confines the dominant computational cost to sparse matrix factorizations of size $(n+m)\times(n+m)$ , well below the cubic scaling of the full KKT system $(n+m+p)\times(n+m+p)$ . Methods such as SCQPTH further exploit data scaling, adaptive penalty selection, and reuse of factorizations to support $n=100$ –$1000$ and $m$ in the thousands, with routine use of batched GPU accelerators (Butler, 2023). Empirically, ADMM-based dQP layers achieve 1–10 $\times$ improvements in solve time over interior-point-based OptNet or conic-layer CVXPY implementations at comparable accuracy—even on dense, large-scale QPs (Butler et al., 2021, Butler, 2023).

3. Differentiation: Implicit Backward Sensitivity

A core attribute of dQP is the ability to efficiently compute the derivatives $\tfrac{\partial x^\star}{\partial \theta}$ for backpropagation. ADMM-based approaches structure their fixed-point iteration as $v^{*} = F(v^{*}; \theta)$ (where $v$ collects primal and scaled dual variables). Differentiation proceeds via the implicit function theorem: $\frac{\partial v^*}{\partial \theta} = \left[I - \frac{\partial F}{\partial v}\bigg|_{v^*}\right]^{-1} \frac{\partial F}{\partial \theta}\bigg|_{v^*}$ (Butler et al., 2021, Butler, 2023). This mechanism requires solving a linear system of size $(n+m)$ in the backward pass, independent of iteration count, contrasting with unrolled differentiation, which stores all intermediate iterates ( $O(K n)$ memory), and KKT-based implicit differentiation, which necessitates a much larger $(n+m+p)\times(n+m+p)$ system (Butler et al., 2021).

Specialized structured-problems (e.g., graph cut QPs in object-centric representation learning) further exploit sparsity, enabling custom backward routines based on Schur complements and sparse Cholesky factors for image-scale problems ( $m\sim 50\,000$ ), achieving forward+backward solves in milliseconds (Pervez et al., 2022).

4. Alternative Splitting, Unrolling, and Learning-Based dQP Models

Recent advances demonstrate deep unrolling of operator algorithms, where each iterative step of a splitting scheme (e.g., Douglas–Rachford, PDHG) is interpreted as a neural network layer (Yang et al., 2 Dec 2024, Xiong et al., 16 Aug 2025). For example, unrolled Douglas–Rachford replaces the costly prox-linear system with a single gradient step, maintaining convergence and interpretability, and achieves up to 40%–55% reduction in solve times versus the classical algorithm (Xiong et al., 16 Aug 2025).

Similarly, PDQP-net unrolls a primal-dual hybrid gradient method into a differentiable architecture, trained directly on KKT-residual-based losses rather than ground-truth labels—eliminating the need for expensive supervised solver-based training (Yang et al., 2 Dec 2024). Theoretical results show network parameters can be chosen so the unrolled network precisely recovers each PDQP iteration, with polynomial size and guaranteed linear convergence under convexity. Empirically, warm-starting PDQP with PDQP-net predictions produces 30%–45% acceleration in solves, with negligible overhead.

5. Active-Set Sensitivity and Black-Box Differentiability

The dQP framework (Magoon et al., 8 Oct 2024) generalizes sensitivity analysis: rather than requiring solver internals (duals, factorization, etc.), knowledge of the active set at QP optimum allows explicit differentiation. After identification of the active constraint set, the solution and all relevant sensitivities can be computed via a reduced KKT system (size $n+p+|J|$ ), yielding derivatives: $\left(\frac{\partial z^*}{\partial \theta}\right) = -K_J^{-1} \left( \frac{\partial K_J}{\partial \theta}\begin{pmatrix} z^* \ \lambda^* \ \mu_J^* \end{pmatrix} - \frac{\partial v_J}{\partial \theta} \right)$ This architecture enables plug-and-play differentiation for any QP solver (Gurobi, OSQP, MOSEK, PIQP, etc.), rendering the system entirely agnostic to solver internals, with $2$–$3$ order-of-magnitude speedups versus unrolled or full-KKT implicit approaches on high-dimensional problems (Magoon et al., 8 Oct 2024).

6. Application Domains and Empirical Performance

dQP layers have demonstrated significant impact in end-to-end learning for data-driven decision making, especially in portfolio optimization. For mean-variance portfolio construction ( $n\approx 250$ –$1000$ assets), ADMM-based dQP layers achieve Sharpe ratios (out-of-sample 2015–2020) of $\approx 1.34$ , matching OptNet ( $\approx 1.28$ ), while achieving $5\times$ faster training at $n\approx 255$ (Butler et al., 2021, Pan et al., 28 Nov 2024). BPQP, a backward-pass QP reformulation for gradient calculation, achieves $13.5\times$ faster backward time than CVXPY/SCS, $30\times$ faster than OptNet, while retaining gradient norm-consistency for inexact forward solves (Pan et al., 28 Nov 2024).

For object-centric representation learning via differentiable graph cuts, problem instances with up to $52\,000$ primal variables and $23\,000$ constraints are integrated efficiently, with differentiable gradients and batched GPU execution (Pervez et al., 2022).

7. Extensions, Limitations, and Best Practices

dQP layers support scaling to larger problems via exploitation of problem sparsity ( $P$ , $A$ ), GPU batch factorization, and custom first-order solvers. The splitting paradigm naturally extends to second-order cone and semidefinite constraints by introducing additional proximal mappings (Butler et al., 2021, Healey et al., 24 Aug 2025). Adaptive penalty selection ( $\rho$ ), residual balancing, and Anderson/Nesterov acceleration are best-practice techniques for convergence and robustness (Butler et al., 2021, Butler, 2023). Hybrid methods—where ADMM outputs are used to warm-start interior-point solvers for high-accuracy requirements ("solution polishing")—are also prevalent.

BPQP is compatible with any advanced QP solver for backward passes and provides KKT-norm-preserving gradients matching the forward solve residual (Pan et al., 28 Nov 2024). Active-set identification for reduced KKT sensitivity is guaranteed under strict complementarity, with regularization or soft-active-set approaches recommended otherwise (Magoon et al., 8 Oct 2024, Pan et al., 28 Nov 2024). Limitations arise in nonconvex settings; local minima can still be differentiated via local KKT systems, but global optimality is not assured.

In summary, differentiable quadratic programming constitutes an enabling technology for integrating large-scale, constrained optimization directly into differentiable computational graphs. Contemporary dQP layers prioritize first-order, splitting-based solvers and implicit fixed-point differentiation, yielding significant computational savings and scalable, robust integration across decision-making, financial, graphical, and control domains (Butler et al., 2021, Butler, 2023, Pervez et al., 2022, Magoon et al., 8 Oct 2024, Pan et al., 28 Nov 2024, Healey et al., 24 Aug 2025, Xiong et al., 16 Aug 2025, Yang et al., 2 Dec 2024).