Vectorized Derivation of Backpropagation

Updated 5 January 2026

The topic is defined as the consolidation of index-heavy recursive differentiation into concise matrix and tensor operations to enable efficient gradient computation.
It details the methodology for propagating error signals and computing parameter gradients in diverse architectures such as feedforward networks, CNNs, and transformers.
The vectorized derivation leverages the chain rule, Jacobian matrices, and operator calculus, enhancing computational performance and extending to higher-order derivatives.

Backpropagation is the canonical algorithm for gradient-based learning in deep neural networks, providing efficient evaluation of derivatives of a scalar objective with respect to high-dimensional parameter tensors. The vectorized derivation of backpropagation—that is, the translation of index-heavy recursive differentiation into matrix- and tensor-level notation—has significantly extended the practical and analytical scope of deep learning. This article presents a comprehensive, methodology-focused account of the vectorized derivation, with coverage across standard feedforward, convolutional, transformer, and quadratic architectures, embedding the algorithmic principles within the context of chain rule calculus, Jacobian and differential operators, computational graph organization, and parameter-update rules as they manifest in widely used frameworks.

1. Chain Rule and Vectorized Differentials

The foundation of backpropagation is in the application of the chain rule to compositions of vector-valued functions—each neural network layer is considered as a differentiable map with learnable parameters. In the vectorized notation, for an $L$ -layer feedforward network with input $x\in\mathbb{R}^{n_0}$ and activations $a^{(\ell)}\in\mathbb{R}^{n_\ell}$ , the forward pass proceeds as $z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}$ , $a^{(\ell)} = \sigma(z^{(\ell)})$ (Cheng, 2021, Damadi et al., 2023). The chain rule is naturally expressed using Jacobian matrices or vectorized differentials: for $f\colon\mathbb{R}^d\to\mathbb{R}^p$ , the first-order derivative is given by $df(x)[h] = Df(x)^T h$ , and higher-order derivatives are handled via symmetric tensor products and identification theorems (see Theorem 1 in (Chacón et al., 2020)).

For the composite function $y = f^L \circ f^{L-1} \circ \cdots \circ f^1(x)$ , the gradient of a scalar loss $L(y)$ with respect to parameters at each layer is recursively computed, with the chain rule realized as repeated Jacobian multiplications or vectorized tensor contractions.

2. Standard Vectorized δ-Recursion for Feedforward Networks

The classic algorithmic structure of backpropagation involves the per-layer computation of "error signals" $\delta^{(\ell)}$ —the gradients of the loss with respect to the pre-activation variables $z^{(\ell)}$ . For an output-based cost $C(a^{(L)},y)$ and nonlinearities applied coordinatewise, one sets

$\delta^{(L)} = \nabla_{a^{(L)}} C \circ \sigma'(z^{(L)}),$

and propagates backwards: $\delta^{(\ell)} = (W^{(\ell+1)})^T \delta^{(\ell+1)} \circ \sigma'(z^{(\ell)}),$ where " $\circ$ " is the Hadamard product (Cheng, 2021, Damadi et al., 2023, Avrutskiy, 2017).

Parameter gradients are then given by matrix products: $\frac{\partial C}{\partial W^{(\ell)}} = \delta^{(\ell)} (a^{(\ell-1)})^T, \qquad \frac{\partial C}{\partial b^{(\ell)}} = \delta^{(\ell)}$ These vectorized forms subsume older index-heavy presentations and are compatible with mini-batch matrix computation: when aggregating over batch dimension, summation is over samples.

3. Backpropagation in CNNs and Convolutional Layers

Convolutional architectures require adaptation of vectorized backpropagation to account for kernel weights, spatial structure, and the mechanics of padding and stride. In a standard CNN framework, input tensors $A^{(0)}\in\mathbb{R}^{n\times d_0 \times r_0 \times r_0}$ are processed through layers where

$Z^{(\ell)} = W^{(\ell)} * A^{(\ell-1)} + b^{(\ell)}, \qquad A^{(\ell)} = g(Z^{(\ell)}),$

using kernel weights $W^{(\ell)}$ and bias $b^{(\ell)}$ (Boué, 2018). The vectorized backward pass leverages matrix and tensor-level cross-correlation:

$\delta^{(\ell)}$ is backpropagated via transposed convolution (kernel flipping, padding correspondences, stride upsampling).
Parameter gradients for kernel weights are computed via

$\frac{\partial L}{\partial W^{(\ell)}} = A^{(\ell-1)} \star \delta^{(\ell)},$

where $\star$ denotes appropriate cross-correlation with respect to all batch and spatial dimensions.

Efficient implementations utilize im2col and GEMM transformations, with convolutions folded into matrix multiplications for both forward and backward paths.

4. Jacobian, Differential, and Operator-Theoretic Formulations

Recent vectorized derivations utilize the full machinery of Jacobian matrices and operator calculus. For each layer's transformation, the total derivative is constructed from block Jacobians:

$J_{z^{(\ell)}, a^{(\ell-1)}} = W^{(\ell)}$ ,
$J_{a^{(\ell)}, z^{(\ell)}} = \text{diag}(\sigma'(z^{(\ell)}))$ ,
$J_{z^{(\ell)}, W^{(\ell)}} = [a^{(\ell-1)} \otimes I]$ , with final gradients expressed via sequential transpositions and multiplications of these Jacobians with the seed gradient at output (Damadi et al., 2023).

Alternative formulations interpret the network as a block-triangular linear system, with reverse-mode differentiation implemented as a back-substitution process. The entire set of parameter gradients can be encapsulated by

$\nabla_{[W, B]} \mathcal{L} = M^T ((I - L)^T \backslash g),$

where $M$ and $L$ collect the operator blocks for all layers, and the $\backslash$ denotes triangular solve (Edelman et al., 2023). This unifies recursive error signal calculation, parameter gradient updates, and adjoint operator application under broad functional-analytic notation.

5. Extensions: Higher Order, Output Derivatives, and Specialized Architectures

Further generalizations encompass:

Training with respect to derivatives of network outputs, enabling direct solution of PDEs, using extended $\delta$ recursions and multi-indexed differential operators (Avrutskiy, 2017). For input $x\in \mathbb{R}^d$ , higher order derivatives in the forward and backward passes are propagated as

$\mathbb{D}^S z_n^\theta = \sum_\kappa W_{n,n-1}^{\theta\kappa}\; \mathbb{D}^S [\sigma(z_{n-1}^\kappa)],$

with parameter gradients built from these derivatives.

Quadratic neurons, where activations are given by $\sigma( x^T Q x + w^T x + b )$ , require specialized gradient formulas via the chain rule for quadratic forms. For matrix $Q$ (symmetric), the gradient is

$\frac{\partial J}{\partial Q} = \delta M(a),$

with $M(x)_{ij} = 2x_i x_j$ if $i\ne j$ , and $x_i^2$ else (Noel et al., 2023).

6. Application to Transformer Architectures and Modern Blocks

In transformer models, the vectorized derivation is applied to embeddings, multi-headed self-attention, layer normalization, and parameter-efficient fine-tuning (LoRA). The notation remains index-free:

For token embedding: $\Delta W_{\rm emb} = X^T \Delta_A$ .
For self-attention: backward-gradient propagation splits $\Delta_O$ into heads, passes through softmax, scaling, and projections, finally summing per-head contributions onto input and parameter gradients.
Layer normalization and LoRA gradients are similarly explicated, with all tensor manipulations specified in matrix calculus (Boué, 29 Dec 2025).

PyTorch implementations of these update rules are direct, as autograd engine utilizes the same vectorized chain rule and batched matrix multiplications.

7. Practical and Computational Implications

The vectorized approach to backpropagation confers several practical advantages:

Eliminates index-level bookkeeping, enabling rapid parallelization (BLAS/GEMM) and compatibility with auto-differentiation frameworks.
Facilitates the handling of large minibatches, structured layers, and novel neuron types.
Provides immediate generalization to higher-order derivatives and operator-theoretic extensions.

All formulas and architectures described can be instantiated in array-based frameworks, with gradients expressed as matrix multiplications and elementwise operations, ready for efficient computation (Boué, 2018, Cheng, 2021, Damadi et al., 2023, Edelman et al., 2023, Noel et al., 2023, Boué, 29 Dec 2025, Chacón et al., 2020, Avrutskiy, 2017). The applicability and extensibility of vectorized backpropagation continue to impact deep learning methodology, analysis, and specialized model development.