Forward-Mode Automatic Differentiation

Updated 6 March 2026

Forward-mode automatic differentiation is an algorithm that computes derivatives by propagating dual numbers and tangent values alongside function evaluations.
It is implemented through techniques like operator overloading, source-code transformation, or AST-based methods to efficiently execute the chain rule.
The method excels in computing directional derivatives and limited-input Jacobians, making it valuable for optimization, machine learning, and scientific applications.

Forward-mode automatic differentiation (AD) is an algorithmic technique enabling the exact and efficient computation of derivatives of functions represented as computer programs. It propagates derivatives from inputs to outputs alongside the normal evaluation, exploiting algebraic structures such as dual numbers and chain rules to compute directional derivatives and, by repeated application or vectorization, full Jacobians. Forward mode is widely employed in scientific computing, optimization, machine learning, and computational physics due to its simplicity, extensibility, and favorable computational complexity for functions with limited input arity.

1. Mathematical Foundations: Dual Numbers and Tangent Propagation

At the core of forward-mode AD lies the algebra of dual numbers. A dual number is a pair $(x, \dot{x})$ , often represented as $x + \varepsilon \dot{x}$ , where $\varepsilon$ is a nilpotent infinitesimal ( $\varepsilon^2 = 0$ ). The arithmetic on dual numbers is defined so that for any smooth scalar function $f$ ,

$f(x + \varepsilon \dot{x}) = f(x) + \varepsilon f'(x)\dot{x}$

For program variables or subexpressions, every value is paired with its tangent (directional derivative w.r.t. a chosen direction or coordinate). The propagation laws from this calculus are:

For $y = x$ , $(y, \dot{y}) = (x, \dot{x})$
For constants, $(c, 0)$
For addition, $(u, \dot{u}) + (v, \dot{v}) = (u + v, \dot{u} + \dot{v})$
For multiplication, $x + \varepsilon \dot{x}$ 0
For univariate functions, $x + \varepsilon \dot{x}$ 1

These recurrence rules are equivalent to the first-order chain rule and realize the tangent propagation transform: $x + \varepsilon \dot{x}$ 2 This dual number framework is formally equivalent to a structural transformation of the computation into operating over the tangent bundle (Schrijvers et al., 2023, Berg et al., 2022, Revels et al., 2016).

2. Algorithmic Structure and Implementations

Forward-mode AD can be realized in three main ways:

Operator overloading: Extends number types (e.g., float) to dual numbers and overloads arithmetic and transcendental functions to propagate tangents alongside the primal values. This approach is common in C++, Julia, or Python AD libraries and can be efficiently implemented using just-in-time compilation and stack allocation as in ForwardDiff.jl (Revels et al., 2016).
Source-code transformation: Explicitly rewrites user code at the source or intermediate representation (IR) level to introduce auxiliary tangent variables and propagate them via code templates derived from the local chain rule. This enables differentiation of dynamically typed or array programs as in Tangent for Python or Clad for C++ (Merriënboer et al., 2018, Vassilev et al., 2020).
Symbolic or AST-based approaches: Recursively walk the program’s abstract syntax tree, associating each node with its value and derivative or gradient as in logic programming languages (e.g., Prolog (Schrijvers et al., 2023)).

All these styles realize the forward tangent propagation as a traversal of the program’s computational graph, without constructing or traversing the graph in reverse.

3. Computational Complexity and Performance Behavior

The computational cost of forward-mode AD for a function $x + \varepsilon \dot{x}$ 3 depends on the desired derivative:

Directional derivatives (Jacobian-vector products): For any vector $x + \varepsilon \dot{x}$ 4, the evaluation of $x + \varepsilon \dot{x}$ 5 (the Jacobian-vector product, or JVP) is proportional to the cost of evaluating $x + \varepsilon \dot{x}$ 6, up to a small constant factor. Only one pass over the computation is needed per direction (Hoffmann, 2014, Berg et al., 2022, Schrijvers et al., 2023).
Full Jacobian matrices or gradients: To compute all $x + \varepsilon \dot{x}$ 7 partials in the gradient or Jacobian, forward mode must be run once per input variable or per column of the Jacobian. Consequently, for functions with large input dimension $x + \varepsilon \dot{x}$ 8 and small output dimension $x + \varepsilon \dot{x}$ 9 (e.g., scalar-valued loss functions in machine learning with many parameters), the total cost scales as $\varepsilon$ 0 (Revels et al., 2016, Shaikhha et al., 2022).
Comparison with reverse mode: Reverse-mode AD (backpropagation) propagates adjoints from outputs to inputs, yielding the full gradient with a single backward pass at $\varepsilon$ $ε$ 1 cost, provided the output dimension is small. Thus:
- Forward mode is optimal for Jacobian-vector products, functions with few inputs, or when per-direction derivatives suffice (Berg et al., 2022).
- Reverse mode dominates for scalar-output functions and high-dimensional inputs (Shaikhha et al., 2022, Revels et al., 2016, Berg et al., 2022).

Empirical Results

Optimized forward-mode AD systems, such as ForwardDiff.jl, can outperform reverse-mode systems for moderate input sizes due to lower memory requirements and efficient stack allocation. For example, in high-dimensional gradient computation (Ackley function, input size 12,000), ForwardDiff outperforms a C++ implementation in some regimes (Revels et al., 2016). Loop fusion and global code motion in array-processing languages can bring forward-mode efficiency to parity with reverse mode even for vector or matrix code (Shaikhha et al., 2022).

4. Extensions: Vectorization, Higher Derivatives, Stochastic Gradients

Vector-Forward Mode: By generalizing the dual number from $\varepsilon$ 2 to $\varepsilon$ 3, one can propagate multiple derivatives simultaneously (Revels et al., 2016). Chunked or block-wise strategies balance memory and computational efficiency.
Higher-Order Derivatives: Nesting duals (e.g., hyper-dual numbers) enables simultaneous propagation of first and second derivatives. Hyper-dual numbers of the form $\varepsilon$ 4 realize quadratic forms and mixed Hessians in one pass (Cobb et al., 2024, Hoffmann, 2014).
Randomized/Monte Carlo Forward Gradients: Recent methods replace the explicit computation of the full gradient with unbiased stochastic estimators, such as computing a directional derivative along a random direction and forming the estimator $\varepsilon$ 5, yielding an unbiased estimate of the true gradient. This approach enables “forward gradient descent” and achieves practical speedups in large-scale settings by reducing memory and computation compared to backpropagation (Shukla et al., 2023, Baydin et al., 2022).
Structured Arrays, Higher-Order Functions: Forward-mode AD can be structured to handle higher-order functions (lambdas, folds, builds) and array combinators in functional array-processing languages, provided the dual-number propagation is pushed through all combinators and aggressive global optimization collapses naïve re-evaluation into fused loops (Shaikhha et al., 2022).

5. Correctness, Semantics, and Formal Verification

Semantically, forward-mode AD is equivalent to computing the pushforward (differential) or Taylor expansion of the target function. In coordinate-free terms, the forward transformation realizes the tangent map: $\varepsilon$ 6 for functions $\varepsilon$ 7 between vector spaces, where $\varepsilon$ 8 is the Jacobian at $\varepsilon$ 9 acting on the tangent vector $\varepsilon^2 = 0$ 0 (Lezcano-Casado, 2022).

Algebraic and Logical-Relations Models: Forward mode is derived from generic algebraic constructions (Nagata idealization over semirings and modules, Kronecker delta functions, etc.), and correctness follows from induction on these algebraic structures (Berg et al., 2022).
Denotational correctness: Semantic logical-relations arguments, constructed over diffeological spaces or domains, provide mechanically verified correctness in the presence of partiality, higher-order functions, and general recursion (Vákár, 2020, Kelly et al., 2016).
Type-discipline: Substructural linear type systems guarantee that tangent-propagation is algebraically linear in the tangent input, a property exploited in separating forward and reverse phases (“You Only Linearize Once”) (Radul et al., 2022).

6. Applications, Domain-Specific Optimizations, and Mixed-Mode Schemes

Forward-mode AD is employed in diverse application domains:

Optimization and Machine Learning: In problems with few parameters or where directional/Hessian-vector products are needed (e.g., line search, hyperplane search, second-order methods without backpropagation), forward-mode AD provides efficient primitives (Cobb et al., 2024).
Compiled Programs and Legacy Code: Forward-mode AD can be retrofitted to compiled binaries (e.g., C, Fortran, Python) via binary translation frameworks such as Derivgrind, which instruments machine code with shadow variable propagation, enabling gradient computation even when source is unavailable (Aehle et al., 2022).
Broadcast Kernels and Mixed-Mode GPU Schemes: In large-scale, elementwise or broadcasted operations, forward-mode AD fully exploits the inherently sparse structure (block-diagonal Jacobians) by fusing primal and tangent computations in a single GPU kernel, allowing arbitrary data-dependent control flow and outperforming reverse mode on such subgraphs (Revels et al., 2018).
Tensor Renormalization Group (TRG): In statistical physics, forward-mode AD enables efficient propagation of all derivatives up to order $\varepsilon^2 = 0$ 1 during coarse-graining, with computational and memory scaling of $\varepsilon^2 = 0$ 2 the original cost and $\varepsilon^2 = 0$ 3 memory, yielding superior accuracy and smooth interpolation with impurity methods (Sugimoto, 9 Feb 2026).

7. Limitations and Trade-offs

Scalability in input dimension: Forward mode’s principal limitation is the linear scaling in the number of input variables when a full gradient is needed. In high-dimensional settings with scalar outputs, reverse mode is generally preferred (Revels et al., 2016, Berg et al., 2022).
Memory and Performance: Forward mode excels in settings with limited input dimension, dense outputs, and when sparse or structured Jacobian-vector products are required. Its streaming nature avoids the memory overhead of storing full computation traces needed in reverse mode (Aehle et al., 2022).
Composability: Forward and reverse mode can be composed (mixed-mode AD) for higher-order derivatives (e.g., Hessian-vector products: apply forward mode to a reverse-mode gradient function), enabling efficient second-order optimization and curvature estimation (Merriënboer et al., 2018, Shaikhha et al., 2022).
Tooling and Language Support: Modern AD systems provide mature support for forward-mode via operator overloading, source transformation, and runtime IR instrumentation, across compiled, interpreted, and dynamic array languages (Revels et al., 2016, Vassilev et al., 2020, Merriënboer et al., 2018, Aehle et al., 2022).

References:

Automatic Differentiation in Prolog (Schrijvers et al., 2023)
Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms (Shukla et al., 2023)
Forward-Mode Automatic Differentiation in Julia (Revels et al., 2016)
Efficient and Sound Differentiable Programming in a Functional Array-Processing Language (Shaikhha et al., 2022)
Dynamic Automatic Differentiation of GPU Broadcast Kernels (Revels et al., 2018)
Evolving the Incremental λ Calculus into a Model of Forward Automatic Differentiation (AD) (Kelly et al., 2016)
Forward-mode automatic differentiation for the tensor renormalization group and its relation to the impurity method (Sugimoto, 9 Feb 2026)
Second-Order Forward-Mode Automatic Differentiation for Optimization (Cobb et al., 2024)
Forward-Mode Automatic Differentiation of Compiled Programs (Aehle et al., 2022)
Automatic Differentiation in ROOT (Vassilev et al., 2020)
Automatic Differentiation: Theory and Practice (Lezcano-Casado, 2022)
A Hitchhiker's Guide to Automatic Differentiation (Hoffmann, 2014)
Forward- or Reverse-Mode Automatic Differentiation: What's the Difference? (Berg et al., 2022)
Gradients without Backpropagation (Baydin et al., 2022)
Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming (Merriënboer et al., 2018)
Denotational Correctness of Forward-Mode Automatic Differentiation for Iteration and Recursion (Vákár, 2020)
You Only Linearize Once: Tangents Transpose to Gradients (Radul et al., 2022)