Papers
Topics
Authors
Recent
Search
2000 character limit reached

Forward-Mode AD: Principles & Practice

Updated 26 June 2026
  • Forward-mode AD is a computational method that computes derivatives by augmenting every operation with its tangent component using dual numbers or Taylor arithmetic.
  • It offers a constant-factor overhead for directional derivative computation and is most efficient when the number of inputs is comparable to or less than the number of outputs.
  • Modern implementations leverage techniques like SIMD vectorization, operator overloading, and source transformation to optimize performance in fields such as optimization, scientific modeling, and machine learning.

Forward-mode automatic differentiation (AD) is a rigorous computational technique for the exact propagation of derivatives (directional, partial, or higher-order) through numerical programs by augmenting each intermediate value with its corresponding tangent or derivative part. Unlike finite differences—which are inherently approximate and sensitive to discretization—forward-mode AD leverages the chain rule at the level of program composition, enabling computation of directional derivatives with only a constant-factor overhead relative to the original program. This is achieved without explicit construction of Jacobians or reliance on symbolic differentiation, making it essential for modern scientific computing, machine learning, and large-scale engineering applications.

1. Mathematical Foundations and Formalism

At the core of forward-mode AD is the abstraction of differentiability in finite-dimensional vector spaces. Let VV and WW be finite-dimensional real vector spaces and f:VWf: V \to W a smooth map. At xVx \in V, the differential dfx:TxVVTf(x)WWdf_x : T_x V \simeq V \to T_{f(x)} W \simeq W is the linear map associated to the first-order Taylor expansion:

f(x+h)=f(x)+dfx(h)+o(h).f(x + h) = f(x) + df_x(h) + o(\|h\|).

Given a tangent vector vVv \in V, the pushforward dfx(v)Wdf_x(v)\in W is the directional derivative of ff at xx in direction WW0. The propagation of WW1 through a composite computation yields WW2 (Lezcano-Casado, 2022).

The chain rule, in coordinate-free form, states:

WW3

for smooth WW4 (Lezcano-Casado, 2022).

In computational context, this is realized either through dual numbers or Taylor-polynomial arithmetic. For a scalar example, defining the dual number WW5, where WW6, one computes

WW7

thus propagating both the value and the derivative alongside every operation. For multivariate WW8, forward-mode AD computes the Jacobian-vector product WW9 through a forward trace (Hoffmann, 2014, Lezcano-Casado, 2022, Revels et al., 2016).

2. Algorithmic Realization and Typing Systems

Modern implementations of forward-mode AD often proceed via operator overloading or source-to-source transformation. Each elementary operation is overloaded to propagate not only the primal value but also the tangent components.

A formal abstraction appears in the “Linear A” type system (Radul et al., 2022), where code is annotated with both primal ("non-linear") and tangent ("linear") variables, with the J-transform mapping f:VWf: V \to W0, producing both primals and tangents. Linearity is strictly enforced: each tangent variable must be used exactly once (except for explicit duplication via a “dup” operation), guaranteeing algebraic linearity of the overall transformation (i.e., tangent propagation is a linear function of the seed vector) (Radul et al., 2022).

For languages with higher-order types, recursive types, or partiality, fully type-preserving macros define the dual-type mapping

f:VWf: V \to W1

with the correctness theorem f:VWf: V \to W2, preserving semantics even in the presence of iteration and recursion (Vákár, 2020).

3. Computational Properties and Complexity

The critical computational property of forward-mode AD is that for f:VWf: V \to W3 and a specific direction f:VWf: V \to W4, it computes f:VWf: V \to W5 in time f:VWf: V \to W6. To compute the full Jacobian, f:VWf: V \to W7 passes are typically required—each with a seed equal to a column of the identity matrix—resulting in overall f:VWf: V \to W8 complexity (Hoffmann, 2014, Revels et al., 2016).

Forward-mode is thus optimal when f:VWf: V \to W9, i.e., when the number of input variables is comparable to or less than the number of outputs. In reverse mode, most efficient when xVx \in V0, one performs a single forward pass and then xVx \in V1 backward sweeps, recovering all adjoints at an asymptotically lower cost for high-dimensional gradients (Revels et al., 2016, Shaikhha et al., 2022). For higher-order derivatives, forward-mode AD may be recursively applied by nesting dual numbers (hyper-dual approach), though this quickly becomes combinatorially expensive (xVx \in V2 for xVx \in V3-th order derivatives) (Hoffmann, 2014, Walter et al., 2010).

Comparison Table: Forward vs Reverse Mode AD

Mode Single JVP Cost Full Jacobian Cost Best Use Case
Forward O(Cost(f)) O(n * Cost(f)) n ≲ m
Reverse O(Cost(f)) + O(tape) O(m * Cost(f)) + O(tape) m ≲ n, gradients

4. Practical Implementations, Language Integration, and Parallelization

Forward-mode AD is ubiquitous in scientific computing environments ranging from Julia (ForwardDiff.jl (Revels et al., 2016)), Rust (ad-trait (Liang et al., 22 Apr 2025)), Prolog (Schrijvers et al., 2023), C++ (CHESSFAD (Ranjan et al., 2024)), and dynamic binary instrumentation tools (Derivgrind (Aehle et al., 2022)).

Techniques for practical efficiency include:

  • Chunked vector forward-mode: processing multiple seeds in a single pass using SIMD vectorization, as in ad-trait (SIMD-accelerated tangent arrays) and ForwardDiff’s chunk mode, which reduces passes from xVx \in V4 to xVx \in V5 (Liang et al., 22 Apr 2025, Revels et al., 2016, Ranjan et al., 2024).
  • Operator overloading vs. source code transformation: Operator overloading yields rapid prototyping (used in ForwardDiff, ad-trait, Geant4 EasyAD (Aehle et al., 2024)), while source transformation is preferred in array languages and for full program optimization (Shaikhha et al., 2022).
  • Parallelization: Row-wise and chunk-wise parallelism is exploited for, e.g., Hessian-vector products on GPUs in CHESSFAD (Ranjan et al., 2024).
  • Dynamic binary instrumentation (Derivgrind): Instrumentation at the VEX IR (Valgrind) level allows forward-mode AD to be transparently injected into compiled binaries, crucial when source access is impossible (Aehle et al., 2022).

Notable applications include optimization and scientific modeling (JuMP (Revels et al., 2016), Newton-Krylov solvers (Pasquale et al., 13 May 2026)), Bayesian neural networks (MCMC via forward-mode MALA (Cobb et al., 23 May 2025)), and high-energy physics simulation (Geant4 (Aehle et al., 2024)).

5. Forward-Mode AD Beyond First-Order: Higher-Order and Implicit Functions

Forward-mode AD extends naturally to higher derivatives using truncated Taylor polynomial algebras or hyper-dual numbers. For univariate and multivariate functions, the forward approach computes derivatives up to order xVx \in V6 with a memory footprint only xVx \in V7 times that of the base program, and multiplies the matrix-multiplication cost by a factor of xVx \in V8 at order xVx \in V9 (Sugimoto, 9 Feb 2026). This makes it suitable for problems requiring Hessians or Hessian-vector products, notably in array and matrix computations (CHESSFAD for GPUs (Ranjan et al., 2024), univariate Taylor propagation for QR and eigenvalue decompositions (Walter et al., 2010)).

Implicit function differentiation, such as QR or symmetric eigendecomposition, is handled via “Hensel lifting”: advancing the Taylor coefficients of matrix factorization order by order, avoiding ill-conditioning due to branching logic or poles in the analytic structure (Walter et al., 2010). The same methodology generalizes to SVD (with regularization to handle near-degenerate singular values) and tensor network contraction trees (Sugimoto, 9 Feb 2026).

6. Correctness, Limitations, and Domain of Applicability

Semantically, forward-mode AD is correct for all first-order programs over smooth primitives; the set of exceptional points (where the output of AD disagrees with the mathematical derivative) is a countable union of measure-zero sets (quasivarieties) associated with branch points or non-differentiability (Mazza et al., 2020). Correctness theorems have been established in higher-order, recursive, and partial languages via logical relations in the denotational semantics of diffeological spaces, extending to type and term recursion (Vákár, 2020). When branching depends on differentiable parameters (e.g., conditionals in PCF or Prolog), the gradient computation may be discontinuous, but the locus of such failures remains negligible in the sense of Lebesgue measure (Mazza et al., 2020, Schrijvers et al., 2023).

Known limitations and trade-offs include:

  • Cost scaling: dfx:TxVVTf(x)WWdf_x : T_x V \simeq V \to T_{f(x)} W \simeq W0 passes required to compute all partials in high input dimension; not optimal for dfx:TxVVTf(x)WWdf_x : T_x V \simeq V \to T_{f(x)} W \simeq W1 (Revels et al., 2016, Hoffmann, 2014).
  • Perturbation confusion in higher-order or nested differentiation (requires renaming/alpha-conversion) (Shaikhha et al., 2022).
  • Code expansion and compile-time overhead with deep dual-number nesting or large chunk sizes (Revels et al., 2016).
  • No “tape”; no reverse accumulation: Forward mode lacks the adjoint computation required for efficiently computing full gradients in many-to-one mappings, motivating use of mixed/hybrid modes for large-scale machine learning (Radul et al., 2022, Pasquale et al., 13 May 2026).

7. Emerging Variants and Theoretical Developments

Recent work connects forward- and reverse-mode AD via a transpositional “tangent transpose” construction: forward-mode is used to linearize the computation and subsequent transposition recovers classical reverse mode, formalized by a linear type system that enforces substructural linearity (one-time usage of tangent resources) (Radul et al., 2022). Checkpointing—traditionally a reverse-mode optimization—appears naturally in this framework via choices in the unzipping of variable environments.

Randomized forward-mode gradient estimators have been introduced to reduce the computational burden by using forward-mode JVPs along randomly sampled directions, yielding unbiased stochastic gradient estimators for optimization (Shukla et al., 2023, Cobb et al., 23 May 2025). These are particularly competitive in high-dimensional or limited-memory settings, as in Langevin MCMC variants that require only directional derivatives rather than full gradients (Cobb et al., 23 May 2025).

Advances in practical AD systems have also clarified the role of loop optimization, fusion, and code motion in functional and array-programming languages. Here, aggressive optimization of dual-number programs renders forward-mode AD as efficient as reverse mode for many fused data-parallel workloads (Shaikhha et al., 2022).


In summary, forward-mode automatic differentiation is a coordinate-free and mathematically rigorous method for pushforward propagation of derivatives, realized algorithmically by dual numbers or Taylor arithmetic and operationalized via a variety of software and hardware platforms. Its precision, minimal memory overhead, and simplicity make it essential in domains ranging from nonlinear PDE solvers and optimization to probabilistic inference and high-energy physics simulation (Lezcano-Casado, 2022, Radul et al., 2022, Mazza et al., 2020, Revels et al., 2016, Aehle et al., 2022, Pasquale et al., 13 May 2026, Liang et al., 22 Apr 2025, Shaikhha et al., 2022, Aehle et al., 2024, Ranjan et al., 2024). Theoretical and practical advances continue to clarify and extend its role, especially in the context of higher-order, symbolic, or stochastic computation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forward-mode AD.