Jacobian Descent: Optimizing Vector Objectives

Updated 10 April 2026

Jacobian Descent (JD) is a framework for optimizing vector-valued objectives by treating each component as an independent target and leveraging Jacobian structures.
JD employs aggregator functions like UpGrad to create non-conflicting descent directions, ensuring principled control and efficient scaling via SVD or Gramian approximations.
JD unifies natural gradient and multi-objective optimization methods, enhancing convergence in both over- and under-parameterized regimes.

Jacobian Descent (JD) is a general framework for optimizing vector-valued objectives, in which each individual objective (e.g., task, residual, mini-batch loss, or training example) is treated as an independent target rather than combining all objectives into a scalar loss before computing updates. In contrast to standard gradient descent, which can only be applied to scalar losses, JD leverages the structure of the objective's Jacobian matrix to synthesize parameter updates that resolve conflicts between objectives, provide principled control over influence and preference, and subsume a range of related methods including natural gradient descent and multi-objective optimization via dual cone projection. Recent algorithmic instantiations—principally Sven (Singular Value Descent) and the UpGrad aggregator—demonstrate competitive empirical performance while admitting efficient SVD- or Gramian-based approximations to handle high-dimensional problems (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

1. Formalism and Core Update Rules

Let $f: \mathbb{R}^n \to \mathbb{R}^m$ be a vector-valued objective function (e.g., a vector of per-task or per-example losses), with parameters $\theta \in \mathbb{R}^n$ . The Jacobian of $f$ at $\theta$ is $J_f(\theta) \in \mathbb{R}^{m \times n}$ , with rows $\nabla f_i(\theta)^T$ . Jacobian Descent replaces the scalar loss gradient with an aggregated direction constructed from these row-gradients. The general JD update step is

$\theta \gets \theta - \eta\,A(J_f(\theta))$

where $A : \mathbb{R}^{m \times n} \to \mathbb{R}^n$ is an aggregator function that combines the $m$ row-gradients into a single descent direction.

In the classical least-squares case, where $L(\theta) = \frac{1}{2} \|r(\theta)\|^2$ for $\theta \in \mathbb{R}^n$ 0 the stacked vector of residuals, the update seeks to solve, to first order,

$\theta \in \mathbb{R}^n$ 1

for minimal $\theta \in \mathbb{R}^n$ 2. The minimum-norm solution is given by the Moore–Penrose pseudoinverse: $\theta \in \mathbb{R}^n$ 3 This generalizes to both over- ( $\theta \in \mathbb{R}^n$ 4) and under- ( $\theta \in \mathbb{R}^n$ 5) parametrized regimes, unifying standard Gauss–Newton/natural-gradient updates and their overcomplete generalization (Bright-Thonney et al., 1 Apr 2026).

In multi-objective optimization, JD acts on vector objectives $\theta \in \mathbb{R}^n$ 6, where each $\theta \in \mathbb{R}^n$ 7 is an objective to minimize. Pareto optimality is defined in terms of the partial order $\theta \in \mathbb{R}^n$ 8, and the aim is to approach the Pareto front through a sequence of non-conflicting updates (Quinton et al., 2024).

2. Aggregator Functions and Unconflicted Descent

The choice of aggregator $\theta \in \mathbb{R}^n$ 9 is central. Recent work introduces the "UpGrad" aggregator, which enforces three properties: (1) non-conflictingness (the update direction never increases any objective locally, i.e., $f$ 0 component-wise), (2) linear scaling under row-scaling, and (3) ensuring $f$ 1 lies in the row-span of $f$ 2. UpGrad is constructed by projecting each individual gradient $f$ 3 onto the dual cone $f$ 4 and averaging: $f$ 5 In practice, this is efficiently implemented in the Gramian form by solving, for each $f$ 6, a small quadratic program

$f$ 7

with the descent direction $f$ 8, $f$ 9 (Quinton et al., 2024).

3. Connections to Natural Gradient Descent and Parameter Regimes

Jacobian Descent generalizes natural gradient and Gauss–Newton methods. In the under-parametrized regime ( $\theta$ 0), $\theta$ 1 is invertible and the JD update recovers the standard natural-gradient form: $\theta$ 2 In the over-parametrized regime ( $\theta$ 3), $\theta$ 4 is singular but $\theta$ 5 is invertible, so $\theta$ 6 remains well-defined and provides a well-posed update. Thus, JD extends the applicability of second-order preconditioned descent to highly overcomplete and ill-conditioned settings, including scientific computing scenarios with many conditions per parameter (Bright-Thonney et al., 1 Apr 2026).

4. Algorithmic Realization and Computational Aspects

The computation of the full pseudoinverse or the full Jacobian can be expensive in high dimensions. Efficient approximations are based on truncated SVD: $\theta$ 7 where only the top- $\theta$ 8 singular values/modes (typically $\theta$ 9 for batch size $J_f(\theta) \in \mathbb{R}^{m \times n}$ 0) are retained, reducing memory and compute cost to $J_f(\theta) \in \mathbb{R}^{m \times n}$ 1, a modest multiple over the cost of a gradient pass.

For general multi-objective problems with $J_f(\theta) \in \mathbb{R}^{m \times n}$ 2 objectives, Gramian-based implementations allow further reductions: each quadratic program for UpGrad is of dimension $J_f(\theta) \in \mathbb{R}^{m \times n}$ 3 (number of objectives), not $J_f(\theta) \in \mathbb{R}^{m \times n}$ 4 (parameter count), leading to substantial acceleration when $J_f(\theta) \in \mathbb{R}^{m \times n}$ 5. Additional scalability techniques include stochastic sub-sampling over rows (objectives/examples), micro-batching, and blockwise-Gaussian elimination in structured architectures. Regularization via singular value thresholding is recommended to ameliorate instability in highly overparameterized or classification settings (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

5. Instance-Wise Risk Minimization (IWRM) and Associated Applications

JD underpins the Instance-Wise Risk Minimization (IWRM) paradigm, where each per-example loss is treated as a separate objective rather than aggregating by mean (ERM). Stochastic Sub-Jacobian Descent (SSJD) generalizes SGD: at each step, one samples a minibatch of per-example losses and aggregates the corresponding sub-Jacobian by UpGrad or another aggregator.

Empirical results on small-scale image classification (SVHN, CIFAR-10, MNIST, Fashion-MNIST, EuroSAT, Kuzushiji-MNIST) show that UpGrad+IWRM accelerates convergence relative to mean aggregation: optimization prioritizes hard examples, resolves substantial early-stage gradient conflict among datapoints, and avoids stagnation seen in magnitude-blind multi-objective aggregators. Additional applications include multi-task learning, adversarial domain adaptation, GANs, various optimizer classes (Nesterov, AdamW), distributed/federated learning, and custom loss compositions in scientific computation (Quinton et al., 2024, Bright-Thonney et al., 1 Apr 2026).

6. Theoretical Properties and Convergence Guarantees

For β-smooth and partially convex $J_f(\theta) \in \mathbb{R}^{m \times n}$ 6, and step size $J_f(\theta) \in \mathbb{R}^{m \times n}$ 7, JD+UpGrad guarantees convergence of $J_f(\theta) \in \mathbb{R}^{m \times n}$ 8 to the Pareto front. The proof relies on a generalized descent lemma and the fact that UpGrad’s weights remain bounded. For any non-negative test weight vector $J_f(\theta) \in \mathbb{R}^{m \times n}$ 9, the linearization $\nabla f_i(\theta)^T$ 0 is provably monotone decreasing, and boundedness plus descent yields convergence. The UpGrad aggregator is the first to meet all three essential properties (non-conflictingness, proportional scaling, and row-span weighting) while retaining algorithmic tractability (Quinton et al., 2024).

7. Practical Considerations and Tooling

The principal computational challenge in JD is memory overhead from per-sample Jacobians, $\nabla f_i(\theta)^T$ 1 for batch size $\nabla f_i(\theta)^T$ 2. Mitigation strategies include micro-batching and parameter batching. The choice of $\nabla f_i(\theta)^T$ 3 in truncated SVD strongly interacts with the effective Jacobian rank, with rapid singular value decay leading to early performance saturation. The threshold for singular value truncation also acts as a regularizer.

A PyTorch library (“torchjd”) implements JD, SEJD, SSJD, UpGrad, and related schemes, supporting easy integration with arbitrary neural architectures by wrapping models, specifying aggregator, and batch/step-size selection.

In summary, Jacobian Descent both unifies and extends the optimization of vector-valued objectives by leveraging the Jacobian structure, allowing principled resolution of conflicts among multiple residuals or tasks, and enabling accelerated convergence and richer risk control compared to scalar-aggregated loss minimization (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method (2026)

Jacobian Descent for Multi-Objective Optimization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jacobian Descent (JD).