Forward-Mode AD: Principles & Practice

Updated 26 June 2026

Forward-mode AD is a computational method that computes derivatives by augmenting every operation with its tangent component using dual numbers or Taylor arithmetic.
It offers a constant-factor overhead for directional derivative computation and is most efficient when the number of inputs is comparable to or less than the number of outputs.
Modern implementations leverage techniques like SIMD vectorization, operator overloading, and source transformation to optimize performance in fields such as optimization, scientific modeling, and machine learning.

Forward-mode automatic differentiation (AD) is a rigorous computational technique for the exact propagation of derivatives (directional, partial, or higher-order) through numerical programs by augmenting each intermediate value with its corresponding tangent or derivative part. Unlike finite differences—which are inherently approximate and sensitive to discretization—forward-mode AD leverages the chain rule at the level of program composition, enabling computation of directional derivatives with only a constant-factor overhead relative to the original program. This is achieved without explicit construction of Jacobians or reliance on symbolic differentiation, making it essential for modern scientific computing, machine learning, and large-scale engineering applications.

1. Mathematical Foundations and Formalism

At the core of forward-mode AD is the abstraction of differentiability in finite-dimensional vector spaces. Let $V$ and $W$ be finite-dimensional real vector spaces and $f: V \to W$ a smooth map. At $x \in V$ , the differential $df_x : T_x V \simeq V \to T_{f(x)} W \simeq W$ is the linear map associated to the first-order Taylor expansion:

$f(x + h) = f(x) + df_x(h) + o(\|h\|).$

Given a tangent vector $v \in V$ , the pushforward $df_x(v)\in W$ is the directional derivative of $f$ at $x$ in direction $W$ 0. The propagation of $W$ 1 through a composite computation yields $W$ 2 (Lezcano-Casado, 2022).

The chain rule, in coordinate-free form, states:

$W$ 3

for smooth $W$ 4 (Lezcano-Casado, 2022).

In computational context, this is realized either through dual numbers or Taylor-polynomial arithmetic. For a scalar example, defining the dual number $W$ 5, where $W$ 6, one computes

$W$ 7

thus propagating both the value and the derivative alongside every operation. For multivariate $W$ 8, forward-mode AD computes the Jacobian-vector product $W$ 9 through a forward trace (Hoffmann, 2014, Lezcano-Casado, 2022, Revels et al., 2016).

2. Algorithmic Realization and Typing Systems

Modern implementations of forward-mode AD often proceed via operator overloading or source-to-source transformation. Each elementary operation is overloaded to propagate not only the primal value but also the tangent components.

A formal abstraction appears in the “Linear A” type system (Radul et al., 2022), where code is annotated with both primal ("non-linear") and tangent ("linear") variables, with the J-transform mapping $f: V \to W$ 0, producing both primals and tangents. Linearity is strictly enforced: each tangent variable must be used exactly once (except for explicit duplication via a “dup” operation), guaranteeing algebraic linearity of the overall transformation (i.e., tangent propagation is a linear function of the seed vector) (Radul et al., 2022).

For languages with higher-order types, recursive types, or partiality, fully type-preserving macros define the dual-type mapping

$f: V \to W$ 1

with the correctness theorem $f: V \to W$ 2, preserving semantics even in the presence of iteration and recursion (Vákár, 2020).

3. Computational Properties and Complexity

The critical computational property of forward-mode AD is that for $f: V \to W$ 3 and a specific direction $f: V \to W$ 4, it computes $f: V \to W$ 5 in time $f: V \to W$ 6. To compute the full Jacobian, $f: V \to W$ 7 passes are typically required—each with a seed equal to a column of the identity matrix—resulting in overall $f: V \to W$ 8 complexity (Hoffmann, 2014, Revels et al., 2016).

Forward-mode is thus optimal when $f: V \to W$ 9, i.e., when the number of input variables is comparable to or less than the number of outputs. In reverse mode, most efficient when $x \in V$ 0, one performs a single forward pass and then $x \in V$ 1 backward sweeps, recovering all adjoints at an asymptotically lower cost for high-dimensional gradients (Revels et al., 2016, Shaikhha et al., 2022). For higher-order derivatives, forward-mode AD may be recursively applied by nesting dual numbers (hyper-dual approach), though this quickly becomes combinatorially expensive ( $x \in V$ 2 for $x \in V$ 3-th order derivatives) (Hoffmann, 2014, Walter et al., 2010).

Comparison Table: Forward vs Reverse Mode AD

Mode	Single JVP Cost	Full Jacobian Cost	Best Use Case
Forward	O(Cost(f))	O(n * Cost(f))	n ≲ m
Reverse	O(Cost(f)) + O(tape)	O(m * Cost(f)) + O(tape)	m ≲ n, gradients

4. Practical Implementations, Language Integration, and Parallelization

Forward-mode AD is ubiquitous in scientific computing environments ranging from Julia (ForwardDiff.jl (Revels et al., 2016)), Rust (ad-trait (Liang et al., 22 Apr 2025)), Prolog (Schrijvers et al., 2023), C++ (CHESSFAD (Ranjan et al., 2024)), and dynamic binary instrumentation tools (Derivgrind (Aehle et al., 2022)).

Techniques for practical efficiency include:

Chunked vector forward-mode: processing multiple seeds in a single pass using SIMD vectorization, as in ad-trait (SIMD-accelerated tangent arrays) and ForwardDiff’s chunk mode, which reduces passes from $x \in V$ 4 to $x \in V$ 5 (Liang et al., 22 Apr 2025, Revels et al., 2016, Ranjan et al., 2024).
Operator overloading vs. source code transformation: Operator overloading yields rapid prototyping (used in ForwardDiff, ad-trait, Geant4 EasyAD (Aehle et al., 2024)), while source transformation is preferred in array languages and for full program optimization (Shaikhha et al., 2022).
Parallelization: Row-wise and chunk-wise parallelism is exploited for, e.g., Hessian-vector products on GPUs in CHESSFAD (Ranjan et al., 2024).
Dynamic binary instrumentation (Derivgrind): Instrumentation at the VEX IR (Valgrind) level allows forward-mode AD to be transparently injected into compiled binaries, crucial when source access is impossible (Aehle et al., 2022).

Notable applications include optimization and scientific modeling (JuMP (Revels et al., 2016), Newton-Krylov solvers (Pasquale et al., 13 May 2026)), Bayesian neural networks (MCMC via forward-mode MALA (Cobb et al., 23 May 2025)), and high-energy physics simulation (Geant4 (Aehle et al., 2024)).

5. Forward-Mode AD Beyond First-Order: Higher-Order and Implicit Functions

Forward-mode AD extends naturally to higher derivatives using truncated Taylor polynomial algebras or hyper-dual numbers. For univariate and multivariate functions, the forward approach computes derivatives up to order $x \in V$ 6 with a memory footprint only $x \in V$ 7 times that of the base program, and multiplies the matrix-multiplication cost by a factor of $x \in V$ 8 at order $x \in V$ 9 (Sugimoto, 9 Feb 2026). This makes it suitable for problems requiring Hessians or Hessian-vector products, notably in array and matrix computations (CHESSFAD for GPUs (Ranjan et al., 2024), univariate Taylor propagation for QR and eigenvalue decompositions (Walter et al., 2010)).

Implicit function differentiation, such as QR or symmetric eigendecomposition, is handled via “Hensel lifting”: advancing the Taylor coefficients of matrix factorization order by order, avoiding ill-conditioning due to branching logic or poles in the analytic structure (Walter et al., 2010). The same methodology generalizes to SVD (with regularization to handle near-degenerate singular values) and tensor network contraction trees (Sugimoto, 9 Feb 2026).

6. Correctness, Limitations, and Domain of Applicability

Semantically, forward-mode AD is correct for all first-order programs over smooth primitives; the set of exceptional points (where the output of AD disagrees with the mathematical derivative) is a countable union of measure-zero sets (quasivarieties) associated with branch points or non-differentiability (Mazza et al., 2020). Correctness theorems have been established in higher-order, recursive, and partial languages via logical relations in the denotational semantics of diffeological spaces, extending to type and term recursion (Vákár, 2020). When branching depends on differentiable parameters (e.g., conditionals in PCF or Prolog), the gradient computation may be discontinuous, but the locus of such failures remains negligible in the sense of Lebesgue measure (Mazza et al., 2020, Schrijvers et al., 2023).

Known limitations and trade-offs include:

Cost scaling: $df_x : T_x V \simeq V \to T_{f(x)} W \simeq W$ 0 passes required to compute all partials in high input dimension; not optimal for $df_x : T_x V \simeq V \to T_{f(x)} W \simeq W$ 1 (Revels et al., 2016, Hoffmann, 2014).
Perturbation confusion in higher-order or nested differentiation (requires renaming/alpha-conversion) (Shaikhha et al., 2022).
Code expansion and compile-time overhead with deep dual-number nesting or large chunk sizes (Revels et al., 2016).
No “tape”; no reverse accumulation: Forward mode lacks the adjoint computation required for efficiently computing full gradients in many-to-one mappings, motivating use of mixed/hybrid modes for large-scale machine learning (Radul et al., 2022, Pasquale et al., 13 May 2026).

7. Emerging Variants and Theoretical Developments

Recent work connects forward- and reverse-mode AD via a transpositional “tangent transpose” construction: forward-mode is used to linearize the computation and subsequent transposition recovers classical reverse mode, formalized by a linear type system that enforces substructural linearity (one-time usage of tangent resources) (Radul et al., 2022). Checkpointing—traditionally a reverse-mode optimization—appears naturally in this framework via choices in the unzipping of variable environments.

Randomized forward-mode gradient estimators have been introduced to reduce the computational burden by using forward-mode JVPs along randomly sampled directions, yielding unbiased stochastic gradient estimators for optimization (Shukla et al., 2023, Cobb et al., 23 May 2025). These are particularly competitive in high-dimensional or limited-memory settings, as in Langevin MCMC variants that require only directional derivatives rather than full gradients (Cobb et al., 23 May 2025).

Advances in practical AD systems have also clarified the role of loop optimization, fusion, and code motion in functional and array-programming languages. Here, aggressive optimization of dual-number programs renders forward-mode AD as efficient as reverse mode for many fused data-parallel workloads (Shaikhha et al., 2022).

In summary, forward-mode automatic differentiation is a coordinate-free and mathematically rigorous method for pushforward propagation of derivatives, realized algorithmically by dual numbers or Taylor arithmetic and operationalized via a variety of software and hardware platforms. Its precision, minimal memory overhead, and simplicity make it essential in domains ranging from nonlinear PDE solvers and optimization to probabilistic inference and high-energy physics simulation (Lezcano-Casado, 2022, Radul et al., 2022, Mazza et al., 2020, Revels et al., 2016, Aehle et al., 2022, Pasquale et al., 13 May 2026, Liang et al., 22 Apr 2025, Shaikhha et al., 2022, Aehle et al., 2024, Ranjan et al., 2024). Theoretical and practical advances continue to clarify and extend its role, especially in the context of higher-order, symbolic, or stochastic computation.