Papers
Topics
Authors
Recent
Search
2000 character limit reached

CHESSFAD: Chunked Hessian AD

Updated 1 May 2026
  • CHESSFAD is an automatic differentiation methodology that computes Hessians and Hessian–vector products using a chunked forward-mode approach.
  • It employs a dual-number design with operator overloading to enable fine-grained parallelism on both CPUs and GPUs while minimizing memory usage.
  • Benchmarks on functions like Rosenbrock and Ackley show significant speedups over traditional AD libraries, especially on NVIDIA GPUs.

CHESSFAD (Chunked HESSian using Forward-mode AD) is an automatic differentiation (AD) methodology and software library for efficient, parallel computation of Hessians and Hessian–vector products, specifically targeting applications where large numbers of such products must be computed across many data points simultaneously on modern accelerators such as NVIDIA GPUs. The core design principles of CHESSFAD leverage forward-mode AD with a chunked approach to enable fine-grained parallelism at multiple levels and minimize memory usage by avoiding explicit materialization of the full Hessian. As a lightweight, header-based C++ implementation, CHESSFAD is portable across both CPUs and GPUs (Ranjan et al., 2024).

1. Mathematical Foundations and Dual-Number Design

Let f:Rn→Rf:\mathbb{R}^n\to\mathbb{R}, with Hessian H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}. The principal task is to evaluate HH or compute products HvH v for arbitrary v∈Rnv \in \mathbb{R}^n.

CHESSFAD employs forward-mode AD extended to second-order by using a custom data structure, hhDual, which encapsulates the function value, relevant first-order derivatives, and a chunk of second-order mixed partials: hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle with {j1,…,jc}\{j_1,\ldots,j_c\} denoting a contiguous chunk of columns in row ii, and cc the chunk size. By operator overloading, arithmetic on H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}0Duals propagates all required higher-order derivatives. For example, for H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}1, H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}2 as above: H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}3 By initializing appropriate slots to Kronecker deltas, a single invocation of the dual-number-templated H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}4 yields multiple entries per row of H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}5 in a single pass.

Hessian–vector products H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}6 are formed incrementally: for each computed chunk H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}7, the partial dot product H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}8 is accumulated into H(x)=[∂i∂jf(x)]i,j=1,…,nH(x) = [\partial_i \partial_j f(x)]_{i,j = 1,\dots,n}9, with results streamed and discarded immediately to avoid full Hessian materialization.

2. Parallelization Strategies: Row and Chunk Concurrency

CHESSFAD exposes two primary, independent levels of parallelism:

  • Row-parallelism (HH0): Each row HH1 of HH2 can be computed independently, since HH3 and HH4 are mutually independent for HH5.
  • Chunk-parallelism (HH6): Each row HH7 is divided into HH8 contiguous chunks of size HH9, enabling concurrent computation of these chunks.

The GPU implementation assigns each (instance, row, chunk) tuple to a GPU thread, as in the following representative CUDA code snippet: c++ __global__ void chessVecKernel(double* x, double* v, double* out, int n, int csize) { int eid = blockIdx.x; int tid = threadIdx.x; int nchunk = n / csize; int i = tid / nchunk; int j = tid % nchunk; CHUNK_INIT(y, &x[eid*n], i, j, n, csize); auto temp = f<hDual<csize>>(y); double partial = 0; for(int l=0; l < csize; ++l) partial += temp.v[csize+2+l] * v[eid*n + j*csize + l]; __shared__ double sprod[MAX_N] [MAX_NCHUNKS]; sprod[i] [j] = partial; __syncthreads(); if(j==0){ double sum=0; for(int k=0; k<nchunk; ++k) sum += sprod[i] [k]; out[eid*n+i]=sum; } } {j1,…,jc}\{j_1,\ldots,j_c\}5c++ template<int csize> struct hDual { double v[2*csize + 2]; // operator overloads }; Initialization macros (CHUNK_INIT) configure each HvH v0Dual so that the required derivative slots (for a given row HvH v1, chunk HvH v2) are set to HvH v3.

On CUDA, thread-local or register-resident HvH v4Dual arrays store input variables for each thread. Intermediate results are accumulated into a shared memory array per block for efficient in-block reductions, and dot products are streamed to output buffers to minimize global memory traffic and avoid constructing the full Hessian tensor.

4. Empirical Performance and Benchmarking

Performance was evaluated against the CPU automatic differentiation library autodiff using three standard benchmark functions generalized to HvH v5 variables: Rosenbrock, Ackley, and Fletcher–Powell. Each experiment consisted of HvH v6 independent Hessian–vector products on CPU and HvH v7 million on GPU. CHESSFAD delivered speedups over autodiff as follows:

Function CPU Speedup A100 GPU Speedup (HvH v8) A100 GPU Speedup (HvH v9)
Rosenbrock ~20% v∈Rnv \in \mathbb{R}^n0 v∈Rnv \in \mathbb{R}^n1
Ackley ~5% v∈Rnv \in \mathbb{R}^n2 v∈Rnv \in \mathbb{R}^n3
Fletcher–Powell ~49% v∈Rnv \in \mathbb{R}^n4 v∈Rnv \in \mathbb{R}^n5

Kernel-only speedup drops sublinearly with respect to v∈Rnv \in \mathbb{R}^n6 due to quadratic scaling, yet remains significant up to v∈Rnv \in \mathbb{R}^n7. The time to perform v∈Rnv \in \mathbb{R}^n8 sequential computations is roughly equivalent to v∈Rnv \in \mathbb{R}^n9 in parallel on the A100 GPU. This suggests strong GPU scaling headroom before parallel efficiency saturates (Ranjan et al., 2024).

5. Computational Complexity and Chunk Size Trade-Offs

Let hh0 denote the number of multiplications and hh1 the number of additions in the scalar code for hh2, with hh3 variables and chunk size hh4.

  • Without symmetry (CHUNK-HESS):
    • Number of function calls: hh5
    • Each hh6Dual–multiply: hh7 multiplies, hh8 adds
    • Each hh9Dual–add: hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle0 adds
    • Total scalar multiplies: hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle1
    • Total scalar adds: hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle2
  • With symmetry (SCHUNK-HESS):
    • Only hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle3 chunks are computed
    • Scalar-multiply count: hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle4
    • The optimal chunk size for fixed hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle5 is hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle6

By streaming out each chunk's result directly into the dot product, memory bandwidth is reduced and the need to store the full Hessian is removed. The chunk size hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle7 introduces a trade-off: larger hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle8 reduces the number of AD passes at the expense of increased per-pass memory, while smaller hDual(u)=⟨u, ∂iu, ∂j1u, …, ∂jcu, ∂ij12u, …, ∂ijc2u⟩hDual(u) = \langle u,\, \partial_i u,\, \partial_{j_1}u,\,\ldots,\,\partial_{j_c}u,\,\partial^2_{i j_1}u,\,\ldots,\,\partial^2_{i j_c}u \rangle9 lightens per-kernel memory but demands more passes. Empirically, chunk sizes in the range {j1,…,jc}\{j_1,\ldots,j_c\}0–{j1,…,jc}\{j_1,\ldots,j_c\}1 offer good balance for {j1,…,jc}\{j_1,\ldots,j_c\}2 up to {j1,…,jc}\{j_1,\ldots,j_c\}3 on GPU (Ranjan et al., 2024).

6. Applicability and Scope

CHESSFAD is applicable in scientific and engineering settings requiring high-throughput, batched computation of Hessian–vector products, such as large-scale optimization, inverse problems, and high-order sensitivity analyses in finite element modeling. The ability to exploit both row- and chunk-level parallelism enables CHESSFAD to efficiently harness modern multi-core CPUs and GPUs.

A plausible implication is that CHESSFAD's chunked forward-mode scheme could be adapted for larger {j1,…,jc}\{j_1,\ldots,j_c\}4 by dynamically selecting chunk sizes and leveraging symmetry when available. Its low memory overhead and advantage in not constructing the full Hessian matrix mark it as particularly suitable for sparse or structured problems where only directional second derivatives are required.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHESSFAD.