Papers
Topics
Authors
Recent
2000 character limit reached

Complementarity-Aware Gradients

Updated 18 November 2025
  • Complementarity-aware gradients are gradient methods designed to incorporate and exploit complementarity structures in optimization problems such as LCS learning.
  • They offer closed-form expressions for gradients and Hessians, enabling efficient computation while bypassing non-smooth LCP solvers through violation-based loss functions.
  • Applications span adaptive gradient methods and composite minimization, leading to improved convergence, stability, and scalability in high-dimensional optimization scenarios.

Complementarity-aware gradients are principled gradient-based constructs designed to accommodate, exploit, or enforce complementarity structures present in optimization, machine learning, and system identification problems. In contexts such as learning Linear Complementarity Systems (LCSs), adaptive gradient methods, and convex composite minimization, complementarity-aware gradients enable efficient, stable, and theoretically well-founded optimization, even in the presence of non-smooth constraints or decomposable subspace structures.

1. Linear Complementarity Systems and Violation-Based Losses

Complementarity-aware gradients play a prominent role in the learning of LCSs, which are discrete-time dynamical systems described by

xt+1=Axt+But+Cλt+d,x_{t+1} = A x_t + B u_t + C \lambda_t + d,

subject to complementarity constraints of the form

0λtwt:=Dxt+Eut+Fλt+c0,0 \leq \lambda_t \perp w_t := D x_t + E u_t + F \lambda_t + c \geq 0,

or equivalently, λt0\lambda_t \geq 0, wt0w_t \geq 0, λtTwt=0\lambda_t^T w_t = 0 with parameters θ={A,B,C,d,D,E,F,c}\theta = \{A, B, C, d, D, E, F, c\}. Instead of differentiating through the non-smooth Linear Complementarity Problem (LCP) solver, which obscures gradient propagation and hampers scalability, a violation-based surrogate loss is minimized. For each data tuple, the loss functional is formulated as

lϵ(θ;xt,ut,xt+1)=minλ0,ϕ0Ldyn(θ;xt,ut,λ)+1ϵ[λTϕ+12γDxt+Eut+Fλ+cϕϕ2],l_\epsilon(\theta; x_t^*, u_t^*, x_{t+1}^*) = \min_{\lambda \geq 0, \phi \geq 0} L_\text{dyn}(\theta; x_t^*, u_t^*, \lambda) + \frac{1}{\epsilon} \left[\lambda^T \phi + \frac{1}{2\gamma} \|D x_t^* + E u_t^* + F \lambda + c - \phi \circ \phi\|^2\right],

where ϕ\phi serves as a differentiable proxy for ww and LdynL_\text{dyn} is the dynamics prediction error. The total loss aggregates over the dataset. This approach ensures that the gradients with respect to θ\theta are differentiable and efficiently computable, as the non-smooth complementarity constraints are handled through smooth penalization (Jin et al., 2021).

2. Closed-Form Gradients and Hessians for Complementarity-Violation Losses

Complementarity-aware gradients in the violation-based LCS loss admit closed-form expressions, given the unique solutions (λtϵ,ϕtϵ)(\lambda_t^\epsilon, \phi_t^\epsilon) to the inner Quadratic Program (QP) for each data point: etdyn:=Axt+But+Cλtϵ+dxt+1,etlcp:=1ϵγ(Dxt+Eut+Fλtϵ+cϕtϵϕtϵ).e_t^\text{dyn} := A x_t^* + B u_t^* + C \lambda_t^\epsilon + d - x_{t+1}^*, \quad e_t^\text{lcp} := \frac{1}{\epsilon \gamma}(D x_t^* + E u_t^* + F \lambda_t^\epsilon + c - \phi_t^\epsilon \circ \phi_t^\epsilon). The full gradient is then accumulated over data samples: ALϵ=tetdyn(xt)T, BLϵ=tetdyn(ut)T, CLϵ=tetdyn(λtϵ)T, dLϵ=tetdyn, DLϵ=tetlcp(xt)T, ELϵ=tetlcp(ut)T, FLϵ=tetlcp(λtϵ)T, cLϵ=tetlcp.\begin{aligned} \nabla_A L_\epsilon &= \sum_t e_t^\text{dyn} (x_t^*)^T, \ \nabla_B L_\epsilon &= \sum_t e_t^\text{dyn} (u_t^*)^T, \ \nabla_C L_\epsilon &= \sum_t e_t^\text{dyn} (\lambda_t^\epsilon)^T, \ \nabla_d L_\epsilon &= \sum_t e_t^\text{dyn}, \ \nabla_D L_\epsilon &= \sum_t e_t^\text{lcp} (x_t^*)^T, \ \nabla_E L_\epsilon &= \sum_t e_t^\text{lcp} (u_t^*)^T, \ \nabla_F L_\epsilon &= \sum_t e_t^\text{lcp} (\lambda_t^\epsilon)^T, \ \nabla_c L_\epsilon &= \sum_t e_t^\text{lcp}. \end{aligned} Second-order derivatives are computed by employing implicit differentiation of the KKT conditions for the QP. The Hessian has block structure and requires only the solution of a linear system of size 2nλ2n_\lambda per datum for backpropagation. All matrix inversions are localized to the QP subproblems, decoupling the complexity from the ambient parameter space (Jin et al., 2021).

3. Smoothing, Strong Convexity, and Algorithmic Efficiency

The introduction of the penalty parameter ϵ\epsilon guarantees smoothness of Lϵ(θ)L_\epsilon(\theta) as a function of the LCS parameters, with the degree of smoothing controlled by ϵ>0\epsilon > 0. The strict convexity of the inner QP is ensured by setting γ<σmin(F+F)\gamma < \sigma_{\min}(F + F^\top) (the minimal eigenvalue), so that optimization with respect to (λ,ϕ)(\lambda, \phi) is numerically robust, and Hessians are invertible under strict complementarity. As ϵ0\epsilon \to 0, recover the original (non-smooth) LCP-based prediction loss, while for ϵ\epsilon sufficiently large, gradient Lipschitz constants can be made arbitrarily small at the cost of increasing bias in the surrogate (Jin et al., 2021).

The resulting algorithm iteratively samples data mini-batches, solves the small, decoupled QPs to obtain violation proxies and adjoint variables, computes closed-form gradients, and updates θ\theta using a standard or adaptive optimizer. Empirically, this reduces compute time by orders of magnitude compared to differentiating through LCP solvers (Jin et al., 2021).

4. Complementary Subspace Decomposition in Adaptive Gradient Methods

Complementarity in gradient-based optimization also appears in adaptive optimization methods for high-dimensional problems. CompAdaGrad (Mehta et al., 2016) partitions the parameter space Rn\mathbb{R}^n into a low-dimensional subspace (acted on by an explicit orthogonal projector PP associated to a random linear map Π\Pi) and its complement (P=IPP^\perp = I - P). Full-matrix AdaGrad regularization is performed in the low-dimensional subspace (capturing rich geometry), while inexpensive diagonal AdaGrad is used in the complementary subspace.

The update geometry is set by the block-diagonal preconditioner: ψt(x)=12xAt(r)+τAt(c)2,\psi_t(x) = \frac{1}{2} \|x\|^2_{A_t^{(r)} + \tau A_t^{(c)}}, with At(r)A_t^{(r)} the full-matrix preconditioner (restricted to the Π\Pi-subspace) and At(c)A_t^{(c)} diagonal in PP^\perp. As a result, the per-round complexity is O(nlogk+k3)O(n \log k + k^3), and complementarity-aware subspace splitting yields improved regret bounds and test performance when the gradient covariance is concentrated in a low-rank structure. This aligns with scenarios where full-matrix statistics capture dominant directions and the remainder can be handled efficiently with diagonal scaling (Mehta et al., 2016).

5. Complementary Composite Minimization in General Norms

The composite minimization paradigm provides another avenue for complementarity-aware gradient methods. Here, the objective F(x)=f(x)+ψ(x)F(x) = f(x) + \psi(x) composes a weakly smooth convex ff (having Hölder gradient continuity) and a uniformly convex ψ\psi (regularization or structural penalty). Complementary composite minimization refers to this decoupling, which is central for handling diverse norm geometries and regularizers simultaneously.

Accelerated algorithms (generalized AGD+^+) maintain several iterates (extrapolated, proximal, and averaged), and their convergence relies on precise choice of learning rates tied to the smoothness and convexity constants of ff and ψ\psi. Notably, the framework addresses minimization of the gradient norm in dual spaces (e.g., enforcing f(x)pϵ\|\nabla f(x)\|_{p^*} \leq \epsilon), with rates matching known lower bounds up to logarithmic terms. This is particularly effective for regression, bridge and p\ell_p minimization, low-rank spectral estimation, and entropy-regularized optimal transport (Diakonikolas et al., 2021).

6. Theoretical Guarantees and Practical Implications

Complementarity-aware gradient approaches deliver several key theoretical properties:

  • Differentiability: Under strict complementarity and proper smoothing, violation-based LCS losses are C1C^1 in the parameter space (Jin et al., 2021).
  • Consistency and Recovery: As the loss penalty parameter ϵ0\epsilon \to 0, minimizers of the surrogate converge to those of the classical LCP-based loss with hard complementarity enforcement (Jin et al., 2021).
  • Regret and Convergence Rates: CompAdaGrad attains regret bounds interpolating between full and diagonal AdaGrad, with rates controlled by the trace of the restricted covariance and the sparsity of the complement (Mehta et al., 2016). Complementary composite minimization achieves (near-)optimal iteration complexity in normed spaces, unifying composite regularization and accelerated convergence frameworks (Diakonikolas et al., 2021).
  • Lipschitz and Stability Control: The Lipschitz constant for the gradient of the violation-based LCS loss can be explicitly controlled via ϵ\epsilon; in composite settings, stability and generalization follow from regularization (Jin et al., 2021, Diakonikolas et al., 2021).

7. Applications and Comparative Analysis

Complementarity-aware gradients are used to:

  • Learn LCS models with tens of thousands of stiff hybrid modes efficiently, without differentiating through LCP solvers (Jin et al., 2021).
  • Train large-scale models with adaptive preconditioning, exploiting low-dimensional geometric regularities for improved test performance, as in kernelized MNIST and other high-dimensional feature settings (Mehta et al., 2016).
  • Optimize under general norm and structure constraints (elastic-net, bridge regression, spectral problems), leveraging the decoupled composite minimization framework to attain optimal rates and practical implementability under a variety of norms (Diakonikolas et al., 2021).

Empirically, the efficacy of complementarity-aware approaches is most pronounced when complementarity or low-dimensional subspace structures are aligned with problem data or regularization desiderata. When gradients lack structure or the feature matrix is highly sparse, the additional structure may provide negligible advantage or introduce bias (Mehta et al., 2016).

In summary, complementarity-aware gradients form a methodological backbone for contemporary approaches in learning hybrid dynamical systems, scalable adaptive optimization, and regularized convex minimization, providing theoretical guarantees, implementation tractability, and broad applicability in data-driven contexts (Jin et al., 2021, Mehta et al., 2016, Diakonikolas et al., 2021).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Complementarity-Aware Gradients.