Cooper: A Library for Constrained Optimization in Deep Learning (2504.01212v1)

Published 1 Apr 2025 in cs.LG and cs.MS

Abstract: Cooper is an open-source package for solving constrained optimization problems involving deep learning models. Cooper implements several Lagrangian-based first-order update schemes, making it easy to combine constrained optimization algorithms with high-level features of PyTorch such as automatic differentiation, and specialized deep learning architectures and optimizers. Although Cooper is specifically designed for deep learning applications where gradients are estimated based on mini-batches, it is suitable for general non-convex continuous constrained optimization. Cooper's source code is available at https://github.com/cooper-org/cooper.

Summary

The paper introduces a modular open-source library that leverages Lagrangian-based first-order methods to address constrained optimization in deep learning models.
It integrates seamlessly with PyTorch by utilizing automatic differentiation and mini-batch processing to update both primal parameters and dual Lagrange multipliers.
Its design enables practical applications in fairness, robustness, and physics-informed neural networks, broadening the toolkit for structured deep learning tasks.

Cooper is an open-source Python library designed to facilitate the solution of constrained optimization problems within the PyTorch deep learning framework (2504.01212). It specifically targets scenarios where deep neural networks form part of the objective function or the constraints. The library implements several Lagrangian-based first-order optimization algorithms, aiming to seamlessly integrate constrained optimization techniques with standard deep learning workflows, including automatic differentiation, mini-batch processing, and the use of various network architectures and optimizers.

Problem Formulation and Lagrangian Methods

Cooper addresses optimization problems typically formulated as:

$\begin{aligned} \min_{w \in \mathcal{W}} \quad & f(w) \ \text{subject to} \quad & g_i(w) \le 0, \quad i = 1, \dots, m \ & h_j(w) = 0, \quad j = 1, \dots, k \end{aligned}$

Here, $w$ represents the parameters of a deep learning model (or other variables), $f(w)$ is the objective function, $g_i(w)$ are inequality constraints, and $h_j(w)$ are equality constraints. In deep learning contexts, $f$ , $g_i$ , and $h_j$ are often non-convex functions involving neural network computations, and gradients are typically estimated using mini-batches of data.

Cooper leverages the Lagrangian formulation to handle these constraints. The associated Lagrangian function is:

$L(w, \lambda, \mu) = f(w) + \sum_{i=1}^m \lambda_i g_i(w) + \sum_{j=1}^k \mu_j h_j(w)$

where $\lambda = (\lambda_1, \dots, \lambda_m)$ are the Lagrange multipliers (dual variables) for the inequality constraints ( $\lambda_i \ge 0$ ), and $\mu = (\mu_1, \dots, \mu_k)$ are the multipliers for the equality constraints.

The original constrained problem is related to the unconstrained saddle-point (min-max) problem:

$\min_w \max_{\lambda \ge 0, \mu} L(w, \lambda, \mu)$

Cooper implements first-order methods to find approximate solutions to this saddle-point problem. These methods typically involve alternating updates to the primal variables ( $w$ ) and the dual variables ( $\lambda, \mu$ ). For instance, a common approach is gradient descent on $w$ and projected gradient ascent on $\lambda, \mu$ :

$w_{t+1} = w_t - \eta_p \nabla_w L(w_t, \lambda_t, \mu_t)$

$\lambda_{t+1} = P_{\ge 0} (\lambda_t + \eta_d \nabla_\lambda L(w_t, \lambda_t, \mu_t)) = P_{\ge 0} (\lambda_t + \eta_d g(w_t))$

$\mu_{t+1} = \mu_t + \eta_d \nabla_\mu L(w_t, \lambda_t, \mu_t) = \mu_t + \eta_d h(w_t)$

where $\eta_p$ and $\eta_d$ are primal and dual learning rates, $P_{\ge 0}$ denotes projection onto the non-negative orthant, $g(w) = (g_1(w), \dots, g_m(w))$ , and $h(w) = (h_1(w), \dots, h_k(w))$ . Cooper implements variations and potentially more sophisticated schemes (e.g., involving augmented Lagrangians or different update rules) compatible with stochastic gradients derived from mini-batches.

Integration with PyTorch and Implementation Structure

A key design principle of Cooper is its tight integration with PyTorch. It allows users to define the objective $f(w)$ and constraints $g_i(w), h_j(w)$ using standard PyTorch modules and operations. Cooper then leverages PyTorch's automatic differentiation (autograd) engine to compute the necessary gradients ( $\nabla_w L$ , $\nabla_\lambda L$ , $\nabla_\mu L$ ) efficiently.

The typical workflow involves these steps:

Define the Problem: Specify the objective function and the constraint functions (both inequality and equality). These are typically functions that take the model parameters (or model output) as input and return scalar values representing the loss or constraint violation levels.
Instantiate ConstrainedMinimizationProblem: Create an instance of Cooper's ConstrainedMinimizationProblem class. This object encapsulates the objective and constraint definitions.
Define Primal and Dual Optimizers: Select standard PyTorch optimizers (e.g., torch.optim.Adam, torch.optim.SGD) for updating the primal parameters ( $w$ ) and Cooper's internal Lagrange multipliers ( $\lambda, \mu$ ). Cooper manages the dual variables and their updates.
Instantiate ConstrainedOptimizer: Wrap the primal optimizer and the ConstrainedMinimizationProblem object within Cooper's ConstrainedOptimizer. This optimizer orchestrates the alternating primal-dual updates.
Optimization Loop: Implement the training loop. In each iteration:
- Compute the objective and constraint values based on a mini-batch.
- Perform a backward pass to compute gradients via ConstrainedOptimizer.zero_grad() and lagrangian.backward(). Cooper computes the Lagrangian internally based on current multiplier values.
- Call ConstrainedOptimizer.step() to update both the primal model parameters and the dual Lagrange multipliers according to the chosen Lagrangian-based scheme.

Below is a conceptual code structure illustrating this workflow:

import torch
import cooper

model = ... # Your torch.nn.Module
objective_fn = ... # Computes f(w) based on model output and batch data
constraint_fns = [...] # List of functions computing g_i(w) or h_j(w)

cmp = cooper.ConstrainedMinimizationProblem(is_constrained=True)

primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

cooper_optimizer = cooper.ConstrainedOptimizer(
    primal_optimizer=primal_optimizer,
    dual_restarts=False, # Option for handling dual variables
    problem=cmp
)

for inputs, targets in dataloader:
    # Compute objective and constraints
    lagrangian = cmp.compute_lagrangian(
        closure=lambda: objective_fn(model(inputs), targets),
        constraints=constraint_fns # Cooper evaluates these internally
    )

    # Perform optimization step (updates primal and dual variables)
    cooper_optimizer.zero_grad()
    lagrangian.backward()
    cooper_optimizer.step()

This structure allows users to leverage familiar PyTorch components while Cooper handles the complexities of the constrained optimization updates, including managing the Lagrange multipliers and ensuring non-negativity for inequality constraints. Its design explicitly supports mini-batch gradient estimates, making it suitable for large-scale deep learning tasks.

Applicability and Use Cases

Cooper's primary domain is deep learning, but its underlying framework is applicable to general non-convex continuous constrained optimization problems where first-order methods are appropriate. Within deep learning, potential applications include:

Fairness Constraints: Enforcing fairness criteria (e.g., demographic parity, equalized odds) by formulating them as constraints on the model's predictions across different demographic groups.
Robustness: Improving model robustness, for example, by constraining the output variation under adversarial perturbations or ensuring stability properties.
Physics-Informed Neural Networks (PINNs): Incorporating physical laws (often expressed as differential equations) as equality or inequality constraints that the neural network output must satisfy.
Structured Regularization: Imposing structural constraints on model parameters or activations, such as sparsity, low-rankness, or specific norms, beyond standard weight decay.
Resource Constraints: Optimizing models under budget constraints related to computational cost, latency, or memory footprint during inference.
Safe Reinforcement Learning: Enforcing safety constraints during policy optimization in reinforcement learning settings.

The library's focus on Lagrangian methods provides a principled way to handle these constraints, offering a trade-off between objective minimization and constraint satisfaction controlled by the Lagrange multipliers, which are learned during optimization.

Conclusion

Cooper provides a valuable tool for practitioners seeking to incorporate complex constraints into deep learning models using PyTorch (2504.01212). By implementing Lagrangian-based first-order methods and integrating tightly with PyTorch's ecosystem, it simplifies the application of constrained optimization techniques to challenging non-convex problems prevalent in the field. Its suitability for mini-batch settings and general applicability make it a potentially useful component in developing fair, robust, physically consistent, or otherwise structured machine learning systems. The source code's availability facilitates its adoption and extension by the research community (github.com).