Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jacobian Descent: Optimizing Vector Objectives

Updated 10 April 2026
  • Jacobian Descent (JD) is a framework for optimizing vector-valued objectives by treating each component as an independent target and leveraging Jacobian structures.
  • JD employs aggregator functions like UpGrad to create non-conflicting descent directions, ensuring principled control and efficient scaling via SVD or Gramian approximations.
  • JD unifies natural gradient and multi-objective optimization methods, enhancing convergence in both over- and under-parameterized regimes.

Jacobian Descent (JD) is a general framework for optimizing vector-valued objectives, in which each individual objective (e.g., task, residual, mini-batch loss, or training example) is treated as an independent target rather than combining all objectives into a scalar loss before computing updates. In contrast to standard gradient descent, which can only be applied to scalar losses, JD leverages the structure of the objective's Jacobian matrix to synthesize parameter updates that resolve conflicts between objectives, provide principled control over influence and preference, and subsume a range of related methods including natural gradient descent and multi-objective optimization via dual cone projection. Recent algorithmic instantiations—principally Sven (Singular Value Descent) and the UpGrad aggregator—demonstrate competitive empirical performance while admitting efficient SVD- or Gramian-based approximations to handle high-dimensional problems (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

1. Formalism and Core Update Rules

Let f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m be a vector-valued objective function (e.g., a vector of per-task or per-example losses), with parameters θRn\theta \in \mathbb{R}^n. The Jacobian of ff at θ\theta is Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}, with rows fi(θ)T\nabla f_i(\theta)^T. Jacobian Descent replaces the scalar loss gradient with an aggregated direction constructed from these row-gradients. The general JD update step is

θθηA(Jf(θ))\theta \gets \theta - \eta\,A(J_f(\theta))

where A:Rm×nRnA : \mathbb{R}^{m \times n} \to \mathbb{R}^n is an aggregator function that combines the mm row-gradients into a single descent direction.

In the classical least-squares case, where L(θ)=12r(θ)2L(\theta) = \frac{1}{2} \|r(\theta)\|^2 for θRn\theta \in \mathbb{R}^n0 the stacked vector of residuals, the update seeks to solve, to first order,

θRn\theta \in \mathbb{R}^n1

for minimal θRn\theta \in \mathbb{R}^n2. The minimum-norm solution is given by the Moore–Penrose pseudoinverse: θRn\theta \in \mathbb{R}^n3 This generalizes to both over- (θRn\theta \in \mathbb{R}^n4) and under- (θRn\theta \in \mathbb{R}^n5) parametrized regimes, unifying standard Gauss–Newton/natural-gradient updates and their overcomplete generalization (Bright-Thonney et al., 1 Apr 2026).

In multi-objective optimization, JD acts on vector objectives θRn\theta \in \mathbb{R}^n6, where each θRn\theta \in \mathbb{R}^n7 is an objective to minimize. Pareto optimality is defined in terms of the partial order θRn\theta \in \mathbb{R}^n8, and the aim is to approach the Pareto front through a sequence of non-conflicting updates (Quinton et al., 2024).

2. Aggregator Functions and Unconflicted Descent

The choice of aggregator θRn\theta \in \mathbb{R}^n9 is central. Recent work introduces the "UpGrad" aggregator, which enforces three properties: (1) non-conflictingness (the update direction never increases any objective locally, i.e., ff0 component-wise), (2) linear scaling under row-scaling, and (3) ensuring ff1 lies in the row-span of ff2. UpGrad is constructed by projecting each individual gradient ff3 onto the dual cone ff4 and averaging: ff5 In practice, this is efficiently implemented in the Gramian form by solving, for each ff6, a small quadratic program

ff7

with the descent direction ff8, ff9 (Quinton et al., 2024).

3. Connections to Natural Gradient Descent and Parameter Regimes

Jacobian Descent generalizes natural gradient and Gauss–Newton methods. In the under-parametrized regime (θ\theta0), θ\theta1 is invertible and the JD update recovers the standard natural-gradient form: θ\theta2 In the over-parametrized regime (θ\theta3), θ\theta4 is singular but θ\theta5 is invertible, so θ\theta6 remains well-defined and provides a well-posed update. Thus, JD extends the applicability of second-order preconditioned descent to highly overcomplete and ill-conditioned settings, including scientific computing scenarios with many conditions per parameter (Bright-Thonney et al., 1 Apr 2026).

4. Algorithmic Realization and Computational Aspects

The computation of the full pseudoinverse or the full Jacobian can be expensive in high dimensions. Efficient approximations are based on truncated SVD: θ\theta7 where only the top-θ\theta8 singular values/modes (typically θ\theta9 for batch size Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}0) are retained, reducing memory and compute cost to Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}1, a modest multiple over the cost of a gradient pass.

For general multi-objective problems with Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}2 objectives, Gramian-based implementations allow further reductions: each quadratic program for UpGrad is of dimension Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}3 (number of objectives), not Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}4 (parameter count), leading to substantial acceleration when Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}5. Additional scalability techniques include stochastic sub-sampling over rows (objectives/examples), micro-batching, and blockwise-Gaussian elimination in structured architectures. Regularization via singular value thresholding is recommended to ameliorate instability in highly overparameterized or classification settings (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

5. Instance-Wise Risk Minimization (IWRM) and Associated Applications

JD underpins the Instance-Wise Risk Minimization (IWRM) paradigm, where each per-example loss is treated as a separate objective rather than aggregating by mean (ERM). Stochastic Sub-Jacobian Descent (SSJD) generalizes SGD: at each step, one samples a minibatch of per-example losses and aggregates the corresponding sub-Jacobian by UpGrad or another aggregator.

Empirical results on small-scale image classification (SVHN, CIFAR-10, MNIST, Fashion-MNIST, EuroSAT, Kuzushiji-MNIST) show that UpGrad+IWRM accelerates convergence relative to mean aggregation: optimization prioritizes hard examples, resolves substantial early-stage gradient conflict among datapoints, and avoids stagnation seen in magnitude-blind multi-objective aggregators. Additional applications include multi-task learning, adversarial domain adaptation, GANs, various optimizer classes (Nesterov, AdamW), distributed/federated learning, and custom loss compositions in scientific computation (Quinton et al., 2024, Bright-Thonney et al., 1 Apr 2026).

6. Theoretical Properties and Convergence Guarantees

For β-smooth and partially convex Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}6, and step size Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}7, JD+UpGrad guarantees convergence of Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}8 to the Pareto front. The proof relies on a generalized descent lemma and the fact that UpGrad’s weights remain bounded. For any non-negative test weight vector Jf(θ)Rm×nJ_f(\theta) \in \mathbb{R}^{m \times n}9, the linearization fi(θ)T\nabla f_i(\theta)^T0 is provably monotone decreasing, and boundedness plus descent yields convergence. The UpGrad aggregator is the first to meet all three essential properties (non-conflictingness, proportional scaling, and row-span weighting) while retaining algorithmic tractability (Quinton et al., 2024).

7. Practical Considerations and Tooling

The principal computational challenge in JD is memory overhead from per-sample Jacobians, fi(θ)T\nabla f_i(\theta)^T1 for batch size fi(θ)T\nabla f_i(\theta)^T2. Mitigation strategies include micro-batching and parameter batching. The choice of fi(θ)T\nabla f_i(\theta)^T3 in truncated SVD strongly interacts with the effective Jacobian rank, with rapid singular value decay leading to early performance saturation. The threshold for singular value truncation also acts as a regularizer.

A PyTorch library (“torchjd”) implements JD, SEJD, SSJD, UpGrad, and related schemes, supporting easy integration with arbitrary neural architectures by wrapping models, specifying aggregator, and batch/step-size selection.

In summary, Jacobian Descent both unifies and extends the optimization of vector-valued objectives by leveraging the Jacobian structure, allowing principled resolution of conflicts among multiple residuals or tasks, and enabling accelerated convergence and richer risk control compared to scalar-aggregated loss minimization (Bright-Thonney et al., 1 Apr 2026, Quinton et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jacobian Descent (JD).