CHESSFAD: Chunked Hessian AD
- CHESSFAD is an automatic differentiation methodology that computes Hessians and Hessian–vector products using a chunked forward-mode approach.
- It employs a dual-number design with operator overloading to enable fine-grained parallelism on both CPUs and GPUs while minimizing memory usage.
- Benchmarks on functions like Rosenbrock and Ackley show significant speedups over traditional AD libraries, especially on NVIDIA GPUs.
CHESSFAD (Chunked HESSian using Forward-mode AD) is an automatic differentiation (AD) methodology and software library for efficient, parallel computation of Hessians and Hessian–vector products, specifically targeting applications where large numbers of such products must be computed across many data points simultaneously on modern accelerators such as NVIDIA GPUs. The core design principles of CHESSFAD leverage forward-mode AD with a chunked approach to enable fine-grained parallelism at multiple levels and minimize memory usage by avoiding explicit materialization of the full Hessian. As a lightweight, header-based C++ implementation, CHESSFAD is portable across both CPUs and GPUs (Ranjan et al., 2024).
1. Mathematical Foundations and Dual-Number Design
Let , with Hessian . The principal task is to evaluate or compute products for arbitrary .
CHESSFAD employs forward-mode AD extended to second-order by using a custom data structure, Dual, which encapsulates the function value, relevant first-order derivatives, and a chunk of second-order mixed partials: with denoting a contiguous chunk of columns in row , and the chunk size. By operator overloading, arithmetic on 0Duals propagates all required higher-order derivatives. For example, for 1, 2 as above: 3 By initializing appropriate slots to Kronecker deltas, a single invocation of the dual-number-templated 4 yields multiple entries per row of 5 in a single pass.
Hessian–vector products 6 are formed incrementally: for each computed chunk 7, the partial dot product 8 is accumulated into 9, with results streamed and discarded immediately to avoid full Hessian materialization.
2. Parallelization Strategies: Row and Chunk Concurrency
CHESSFAD exposes two primary, independent levels of parallelism:
- Row-parallelism (0): Each row 1 of 2 can be computed independently, since 3 and 4 are mutually independent for 5.
- Chunk-parallelism (6): Each row 7 is divided into 8 contiguous chunks of size 9, enabling concurrent computation of these chunks.
The GPU implementation assigns each (instance, row, chunk) tuple to a GPU thread, as in the following representative CUDA code snippet:
c++
__global__ void chessVecKernel(double* x, double* v, double* out, int n, int csize) {
int eid = blockIdx.x;
int tid = threadIdx.x;
int nchunk = n / csize;
int i = tid / nchunk;
int j = tid % nchunk;
CHUNK_INIT(y, &x[eid*n], i, j, n, csize);
auto temp = f<hDual<csize>>(y);
double partial = 0;
for(int l=0; l < csize; ++l)
partial += temp.v[csize+2+l] * v[eid*n + j*csize + l];
__shared__ double sprod[MAX_N] [MAX_NCHUNKS];
sprod[i] [j] = partial;
__syncthreads();
if(j==0){
double sum=0;
for(int k=0; k<nchunk; ++k) sum += sprod[i] [k];
out[eid*n+i]=sum;
}
}
5c++
template<int csize>
struct hDual {
double v[2*csize + 2];
// operator overloads
};
Initialization macros (CHUNK_INIT) configure each 0Dual so that the required derivative slots (for a given row 1, chunk 2) are set to 3.
On CUDA, thread-local or register-resident 4Dual arrays store input variables for each thread. Intermediate results are accumulated into a shared memory array per block for efficient in-block reductions, and dot products are streamed to output buffers to minimize global memory traffic and avoid constructing the full Hessian tensor.
4. Empirical Performance and Benchmarking
Performance was evaluated against the CPU automatic differentiation library autodiff using three standard benchmark functions generalized to 5 variables: Rosenbrock, Ackley, and Fletcher–Powell. Each experiment consisted of 6 independent Hessian–vector products on CPU and 7 million on GPU. CHESSFAD delivered speedups over autodiff as follows:
| Function | CPU Speedup | A100 GPU Speedup (8) | A100 GPU Speedup (9) |
|---|---|---|---|
| Rosenbrock | ~20% | 0 | 1 |
| Ackley | ~5% | 2 | 3 |
| Fletcher–Powell | ~49% | 4 | 5 |
Kernel-only speedup drops sublinearly with respect to 6 due to quadratic scaling, yet remains significant up to 7. The time to perform 8 sequential computations is roughly equivalent to 9 in parallel on the A100 GPU. This suggests strong GPU scaling headroom before parallel efficiency saturates (Ranjan et al., 2024).
5. Computational Complexity and Chunk Size Trade-Offs
Let 0 denote the number of multiplications and 1 the number of additions in the scalar code for 2, with 3 variables and chunk size 4.
- Without symmetry (CHUNK-HESS):
- Number of function calls: 5
- Each 6Dual–multiply: 7 multiplies, 8 adds
- Each 9Dual–add: 0 adds
- Total scalar multiplies: 1
- Total scalar adds: 2
- With symmetry (SCHUNK-HESS):
- Only 3 chunks are computed
- Scalar-multiply count: 4
- The optimal chunk size for fixed 5 is 6
By streaming out each chunk's result directly into the dot product, memory bandwidth is reduced and the need to store the full Hessian is removed. The chunk size 7 introduces a trade-off: larger 8 reduces the number of AD passes at the expense of increased per-pass memory, while smaller 9 lightens per-kernel memory but demands more passes. Empirically, chunk sizes in the range 0–1 offer good balance for 2 up to 3 on GPU (Ranjan et al., 2024).
6. Applicability and Scope
CHESSFAD is applicable in scientific and engineering settings requiring high-throughput, batched computation of Hessian–vector products, such as large-scale optimization, inverse problems, and high-order sensitivity analyses in finite element modeling. The ability to exploit both row- and chunk-level parallelism enables CHESSFAD to efficiently harness modern multi-core CPUs and GPUs.
A plausible implication is that CHESSFAD's chunked forward-mode scheme could be adapted for larger 4 by dynamically selecting chunk sizes and leveraging symmetry when available. Its low memory overhead and advantage in not constructing the full Hessian matrix mark it as particularly suitable for sparse or structured problems where only directional second derivatives are required.