Complementarity-Aware Gradients
- Complementarity-aware gradients are gradient methods designed to incorporate and exploit complementarity structures in optimization problems such as LCS learning.
- They offer closed-form expressions for gradients and Hessians, enabling efficient computation while bypassing non-smooth LCP solvers through violation-based loss functions.
- Applications span adaptive gradient methods and composite minimization, leading to improved convergence, stability, and scalability in high-dimensional optimization scenarios.
Complementarity-aware gradients are principled gradient-based constructs designed to accommodate, exploit, or enforce complementarity structures present in optimization, machine learning, and system identification problems. In contexts such as learning Linear Complementarity Systems (LCSs), adaptive gradient methods, and convex composite minimization, complementarity-aware gradients enable efficient, stable, and theoretically well-founded optimization, even in the presence of non-smooth constraints or decomposable subspace structures.
1. Linear Complementarity Systems and Violation-Based Losses
Complementarity-aware gradients play a prominent role in the learning of LCSs, which are discrete-time dynamical systems described by
subject to complementarity constraints of the form
or equivalently, , , with parameters . Instead of differentiating through the non-smooth Linear Complementarity Problem (LCP) solver, which obscures gradient propagation and hampers scalability, a violation-based surrogate loss is minimized. For each data tuple, the loss functional is formulated as
where serves as a differentiable proxy for and is the dynamics prediction error. The total loss aggregates over the dataset. This approach ensures that the gradients with respect to are differentiable and efficiently computable, as the non-smooth complementarity constraints are handled through smooth penalization (Jin et al., 2021).
2. Closed-Form Gradients and Hessians for Complementarity-Violation Losses
Complementarity-aware gradients in the violation-based LCS loss admit closed-form expressions, given the unique solutions to the inner Quadratic Program (QP) for each data point: The full gradient is then accumulated over data samples: Second-order derivatives are computed by employing implicit differentiation of the KKT conditions for the QP. The Hessian has block structure and requires only the solution of a linear system of size per datum for backpropagation. All matrix inversions are localized to the QP subproblems, decoupling the complexity from the ambient parameter space (Jin et al., 2021).
3. Smoothing, Strong Convexity, and Algorithmic Efficiency
The introduction of the penalty parameter guarantees smoothness of as a function of the LCS parameters, with the degree of smoothing controlled by . The strict convexity of the inner QP is ensured by setting (the minimal eigenvalue), so that optimization with respect to is numerically robust, and Hessians are invertible under strict complementarity. As , recover the original (non-smooth) LCP-based prediction loss, while for sufficiently large, gradient Lipschitz constants can be made arbitrarily small at the cost of increasing bias in the surrogate (Jin et al., 2021).
The resulting algorithm iteratively samples data mini-batches, solves the small, decoupled QPs to obtain violation proxies and adjoint variables, computes closed-form gradients, and updates using a standard or adaptive optimizer. Empirically, this reduces compute time by orders of magnitude compared to differentiating through LCP solvers (Jin et al., 2021).
4. Complementary Subspace Decomposition in Adaptive Gradient Methods
Complementarity in gradient-based optimization also appears in adaptive optimization methods for high-dimensional problems. CompAdaGrad (Mehta et al., 2016) partitions the parameter space into a low-dimensional subspace (acted on by an explicit orthogonal projector associated to a random linear map ) and its complement (). Full-matrix AdaGrad regularization is performed in the low-dimensional subspace (capturing rich geometry), while inexpensive diagonal AdaGrad is used in the complementary subspace.
The update geometry is set by the block-diagonal preconditioner: with the full-matrix preconditioner (restricted to the -subspace) and diagonal in . As a result, the per-round complexity is , and complementarity-aware subspace splitting yields improved regret bounds and test performance when the gradient covariance is concentrated in a low-rank structure. This aligns with scenarios where full-matrix statistics capture dominant directions and the remainder can be handled efficiently with diagonal scaling (Mehta et al., 2016).
5. Complementary Composite Minimization in General Norms
The composite minimization paradigm provides another avenue for complementarity-aware gradient methods. Here, the objective composes a weakly smooth convex (having Hölder gradient continuity) and a uniformly convex (regularization or structural penalty). Complementary composite minimization refers to this decoupling, which is central for handling diverse norm geometries and regularizers simultaneously.
Accelerated algorithms (generalized AGD) maintain several iterates (extrapolated, proximal, and averaged), and their convergence relies on precise choice of learning rates tied to the smoothness and convexity constants of and . Notably, the framework addresses minimization of the gradient norm in dual spaces (e.g., enforcing ), with rates matching known lower bounds up to logarithmic terms. This is particularly effective for regression, bridge and minimization, low-rank spectral estimation, and entropy-regularized optimal transport (Diakonikolas et al., 2021).
6. Theoretical Guarantees and Practical Implications
Complementarity-aware gradient approaches deliver several key theoretical properties:
- Differentiability: Under strict complementarity and proper smoothing, violation-based LCS losses are in the parameter space (Jin et al., 2021).
- Consistency and Recovery: As the loss penalty parameter , minimizers of the surrogate converge to those of the classical LCP-based loss with hard complementarity enforcement (Jin et al., 2021).
- Regret and Convergence Rates: CompAdaGrad attains regret bounds interpolating between full and diagonal AdaGrad, with rates controlled by the trace of the restricted covariance and the sparsity of the complement (Mehta et al., 2016). Complementary composite minimization achieves (near-)optimal iteration complexity in normed spaces, unifying composite regularization and accelerated convergence frameworks (Diakonikolas et al., 2021).
- Lipschitz and Stability Control: The Lipschitz constant for the gradient of the violation-based LCS loss can be explicitly controlled via ; in composite settings, stability and generalization follow from regularization (Jin et al., 2021, Diakonikolas et al., 2021).
7. Applications and Comparative Analysis
Complementarity-aware gradients are used to:
- Learn LCS models with tens of thousands of stiff hybrid modes efficiently, without differentiating through LCP solvers (Jin et al., 2021).
- Train large-scale models with adaptive preconditioning, exploiting low-dimensional geometric regularities for improved test performance, as in kernelized MNIST and other high-dimensional feature settings (Mehta et al., 2016).
- Optimize under general norm and structure constraints (elastic-net, bridge regression, spectral problems), leveraging the decoupled composite minimization framework to attain optimal rates and practical implementability under a variety of norms (Diakonikolas et al., 2021).
Empirically, the efficacy of complementarity-aware approaches is most pronounced when complementarity or low-dimensional subspace structures are aligned with problem data or regularization desiderata. When gradients lack structure or the feature matrix is highly sparse, the additional structure may provide negligible advantage or introduce bias (Mehta et al., 2016).
In summary, complementarity-aware gradients form a methodological backbone for contemporary approaches in learning hybrid dynamical systems, scalable adaptive optimization, and regularized convex minimization, providing theoretical guarantees, implementation tractability, and broad applicability in data-driven contexts (Jin et al., 2021, Mehta et al., 2016, Diakonikolas et al., 2021).